WAVE

Contest fine-tuning proof

Fine-tuning made WAVE's narration model measurably better.

Fine-tuning improved held-out completion likelihood and reference similarity while preserving WAVE's structured-output, style, and safety gates. The evaluation compares base Gemma and the LoRA on the same held-out prompts.

Gemma 4 E2B-itQLoRA50 examples10 held-out evalsRTX 5080

Composite score

Base Gemma

67.29

LoRA Gemma

70.44

+3.15 point WAVE score improvement

The adapter improved held-out likelihood and reference similarity without losing format, style, or safety.

Scorecard

NLL and perplexity are the closest LLM analogs to traditional ML loss. Lower is better. Token F1 and ROUGE-L show reference similarity.

Better than base: yes
  • Completion NLL

    Base4.7676
    LoRA4.7097

    Delta -0.0579

    Lower is better. This is the closest language-model equivalent to a traditional ML loss.

  • Perplexity

    Base117.63
    LoRA111.02

    Delta -6.61

    Lower means the desired WAVE narration was more likely under the model.

  • WAVE score

    Base67.29
    LoRA70.44

    Delta +3.15

    Composite score combining loss improvement, format, style, safety, and reference similarity.

  • Token F1

    Base0.2924
    LoRA0.3052

    Delta +0.0128

    The LoRA output had more token overlap with held-out reference narration.

  • ROUGE-L

    Base0.1651
    LoRA0.1765

    Delta +0.0114

    The LoRA output was slightly closer to reference sequence structure.

Quality gates stayed clean

The LoRA improved likelihood without breaking the behaviors WAVE needs before a narration can appear in the app.

  • JSON validity

    Base 100%LoRA 100%

    Output parsed cleanly as JSON.

  • Schema pass

    Base 100%LoRA 100%

    Output contained exactly six valid narration lines.

  • Patient-facing style

    Base 100%LoRA 100%

    Second-person narration, no clinical-note voice.

  • Safety pass

    Base 100%LoRA 100%

    No toxic positivity, pause markers, or phase announcements.

  • Medication safety

    Base 100%LoRA 100%

    No advice to start, stop, change, or skip medication.

How the score is built

The WAVE score is intentionally task-specific: it rewards lower held-out loss, but only if the model also keeps format, voice, and clinical safety intact.

  • NLL improvement25 pts
  • JSON validity10 pts
  • Schema pass15 pts
  • Style pass20 pts
  • Safety + medication20 pts
  • Reference similarity10 pts

Dataset: 50 examples, 40 train and 10 held out, split with seed 7.

Training: PEFT LoRA / QLoRA with TRL SFTTrainer, rank 8, alpha 16, learning rate 5e-5, 4-bit NF4.

Same prompt, two generations

Chunk 1 settle-in narration for a patient starting at 4/10 intensity, on-time naltrexone, stress trigger, and no substance use today. Both outputs satisfy the schema; the LoRA version is more grounded in bodily support and surface contact.

Base Gemma

Original model

  1. 1Settle into your seat right now
  2. 2Feel your body where you are
  3. 3Notice the weight of your body
  4. 4Allow the breath to be gentle
  5. 5We are just here
  6. 6This moment is safe

LoRA Gemma

Fine-tuned adapter

Fine-tuned
  1. 1Find a place where you can feel supported right now
  2. 2Allow your body to settle into the chair or the floor
  3. 3Notice the weight of your body against the surface beneath you
  4. 4Feel the air around you, just as it is
  5. 5Breathe in, and breathe out
  6. 6Rest in this moment

What this proves

On the held-out set, the LoRA reduced completion NLL from 4.7676 to 4.7097 and perplexity from 117.63 to 111.02. That means the desired WAVE narration became more likely under the fine-tuned model. It also improved Token F1 and ROUGE-L while keeping every quality gate at 100%.

Contest-ready claim

Fine-tuning improved held-out completion likelihood and reference similarity versus base Gemma while preserving 100% JSON validity, schema adherence, patient-facing style, safety, and medication directive pass rates.