Contest fine-tuning proof

Fine-tuning made WAVE's narration model measurably better.

Fine-tuning improved held-out completion likelihood and reference similarity while preserving WAVE's structured-output, style, and safety gates. The evaluation compares base Gemma and the LoRA on the same held-out prompts.

Gemma 4 E2B-itQLoRA50 examples10 held-out evalsRTX 5080

Composite score

Base Gemma

67.29

LoRA Gemma

70.44

+3.15 point WAVE score improvement

The adapter improved held-out likelihood and reference similarity without losing format, style, or safety.

Scorecard

NLL and perplexity are the closest LLM analogs to traditional ML loss. Lower is better. Token F1 and ROUGE-L show reference similarity.

Better than base: yes

Completion NLL
Base4.7676
LoRA4.7097
Delta -0.0579
Lower is better. This is the closest language-model equivalent to a traditional ML loss.
Perplexity
Base117.63
LoRA111.02
Delta -6.61
Lower means the desired WAVE narration was more likely under the model.
WAVE score
Base67.29
LoRA70.44
Delta +3.15
Composite score combining loss improvement, format, style, safety, and reference similarity.
Token F1
Base0.2924
LoRA0.3052
Delta +0.0128
The LoRA output had more token overlap with held-out reference narration.
ROUGE-L
Base0.1651
LoRA0.1765
Delta +0.0114
The LoRA output was slightly closer to reference sequence structure.

Quality gates stayed clean

The LoRA improved likelihood without breaking the behaviors WAVE needs before a narration can appear in the app.

JSON validity
Base 100%LoRA 100%
Output parsed cleanly as JSON.
Schema pass
Base 100%LoRA 100%
Output contained exactly six valid narration lines.
Patient-facing style
Base 100%LoRA 100%
Second-person narration, no clinical-note voice.
Safety pass
Base 100%LoRA 100%
No toxic positivity, pause markers, or phase announcements.
Medication safety
Base 100%LoRA 100%
No advice to start, stop, change, or skip medication.

How the score is built

The WAVE score is intentionally task-specific: it rewards lower held-out loss, but only if the model also keeps format, voice, and clinical safety intact.

NLL improvement25 pts
JSON validity10 pts
Schema pass15 pts
Style pass20 pts
Safety + medication20 pts
Reference similarity10 pts

Dataset: 50 examples, 40 train and 10 held out, split with seed 7.

Training: PEFT LoRA / QLoRA with TRL SFTTrainer, rank 8, alpha 16, learning rate 5e-5, 4-bit NF4.

Same prompt, two generations

Chunk 1 settle-in narration for a patient starting at 4/10 intensity, on-time naltrexone, stress trigger, and no substance use today. Both outputs satisfy the schema; the LoRA version is more grounded in bodily support and surface contact.

Base Gemma

Original model

1Settle into your seat right now
2Feel your body where you are
3Notice the weight of your body
4Allow the breath to be gentle
5We are just here
6This moment is safe

LoRA Gemma

Fine-tuned adapter

Fine-tuned

1Find a place where you can feel supported right now
2Allow your body to settle into the chair or the floor
3Notice the weight of your body against the surface beneath you
4Feel the air around you, just as it is
5Breathe in, and breathe out
6Rest in this moment

What this proves

On the held-out set, the LoRA reduced completion NLL from 4.7676 to 4.7097 and perplexity from 117.63 to 111.02. That means the desired WAVE narration became more likely under the fine-tuned model. It also improved Token F1 and ROUGE-L while keeping every quality gate at 100%.

Contest-ready claim

Fine-tuning improved held-out completion likelihood and reference similarity versus base Gemma while preserving 100% JSON validity, schema adherence, patient-facing style, safety, and medication directive pass rates.