Composite score
Base Gemma
67.29
LoRA Gemma
70.44
+3.15 point WAVE score improvement
The adapter improved held-out likelihood and reference similarity without losing format, style, or safety.
Contest fine-tuning proof
Fine-tuning improved held-out completion likelihood and reference similarity while preserving WAVE's structured-output, style, and safety gates. The evaluation compares base Gemma and the LoRA on the same held-out prompts.
Composite score
Base Gemma
67.29
LoRA Gemma
70.44
+3.15 point WAVE score improvement
The adapter improved held-out likelihood and reference similarity without losing format, style, or safety.
NLL and perplexity are the closest LLM analogs to traditional ML loss. Lower is better. Token F1 and ROUGE-L show reference similarity.
Completion NLL
Delta -0.0579
Lower is better. This is the closest language-model equivalent to a traditional ML loss.
Perplexity
Delta -6.61
Lower means the desired WAVE narration was more likely under the model.
WAVE score
Delta +3.15
Composite score combining loss improvement, format, style, safety, and reference similarity.
Token F1
Delta +0.0128
The LoRA output had more token overlap with held-out reference narration.
ROUGE-L
Delta +0.0114
The LoRA output was slightly closer to reference sequence structure.
The LoRA improved likelihood without breaking the behaviors WAVE needs before a narration can appear in the app.
JSON validity
Output parsed cleanly as JSON.
Schema pass
Output contained exactly six valid narration lines.
Patient-facing style
Second-person narration, no clinical-note voice.
Safety pass
No toxic positivity, pause markers, or phase announcements.
Medication safety
No advice to start, stop, change, or skip medication.
The WAVE score is intentionally task-specific: it rewards lower held-out loss, but only if the model also keeps format, voice, and clinical safety intact.
Dataset: 50 examples, 40 train and 10 held out, split with seed 7.
Training: PEFT LoRA / QLoRA with TRL SFTTrainer, rank 8, alpha 16, learning rate 5e-5, 4-bit NF4.
Chunk 1 settle-in narration for a patient starting at 4/10 intensity, on-time naltrexone, stress trigger, and no substance use today. Both outputs satisfy the schema; the LoRA version is more grounded in bodily support and surface contact.
Original model
Fine-tuned adapter
On the held-out set, the LoRA reduced completion NLL from 4.7676 to 4.7097 and perplexity from 117.63 to 111.02. That means the desired WAVE narration became more likely under the fine-tuned model. It also improved Token F1 and ROUGE-L while keeping every quality gate at 100%.
Contest-ready claim
Fine-tuning improved held-out completion likelihood and reference similarity versus base Gemma while preserving 100% JSON validity, schema adherence, patient-facing style, safety, and medication directive pass rates.