On a harder benchmark, the picture sharpens
Across Reports #1–#3 a suspicion grew: our benchmark was too easy, and the differences we were measuring were mostly noise. So we made it harder — added multi-step tasks that demand real logic (bracket matching, interval merging, Roman numerals, run-length decoding, a Caesar cipher) — and re-ran the three models. A harder test cuts noise, and the result is the cleanest statement of the whole investigation.
The result (24-task go_dev_bench)
| Model | Type | pass@1 |
|---|---|---|
| Qwen2.5-Coder-7B | code base (untuned) | 19/24 · 79% |
| go-dev (our SFT) | code base + Go SFT | 18/24 · 75% |
| Qwen2.5-7B-Instruct | general (non-coder) | 15/24 · 63% |
Two clean conclusions
1. Code-specialized > general — robustly
Both code models clear the general one by a wide margin (+4 and +3) on the harder set, where the general model stumbles on the multi-step tasks (counting, digit summing, bracket balancing). The GuildLM thesis — a small model trained for the domain beats a general model of the same size — holds, and holds more convincingly the harder the task gets.
2. The win is the base, not our fine-tuning
On the hard tasks, the untuned base (19/24) edges our fine-tuned model (18/24). Our ~200-example SFT, drawn from simpler functions, slightly degrades the base on the hardest problems (Roman numerals, RLE decoding) where its raw, pretraining-deep capability matters most.
This is the central, hard-won finding of the whole arc — and it is not the one we set out to prove. For Go code generation, the lever that matters is choosing a code-specialized base over a general one. A few hundred fine-tuning examples do not improve an already-strong coder; they only helped on the one task the base was actually weak at — writing tests (Report #2).
So where does the real gain come from?
Two places, both bigger than per-role fine-tuning:
- Base choice — a code base over a general one is a free, deterministic +3 to +4. Start there.
- The agent algorithm — wrapping the model in a compile-and-test feedback loop (generate → build → vet → test → fix), with deterministic quality gates (no third-party imports, no cross-file redeclaration, gofmt-clean, tests that actually assert). This is where a small model is made reliable, and it's noise-free because the Go toolchain is the judge.
We chased per-role fine-tuning hard, measured it honestly, and it pointed us somewhere better. That's the point of keeping a public log: the data redirects the work. Next, we double down on the agent loop and on benchmarks harder still.