Report #4 · 2026-06-28

On a harder benchmark, the picture sharpens

Across Reports #1–#3 a suspicion grew: our benchmark was too easy, and the differences we were measuring were mostly noise. So we made it harder — added multi-step tasks that demand real logic (bracket matching, interval merging, Roman numerals, run-length decoding, a Caesar cipher) — and re-ran the three models. A harder test cuts noise, and the result is the cleanest statement of the whole investigation.

The result (24-task go_dev_bench)

Model	Type	pass@1
Qwen2.5-Coder-7B	code base (untuned)	19/24 · 79%
go-dev (our SFT)	code base + Go SFT	18/24 · 75%
Qwen2.5-7B-Instruct	general (non-coder)	15/24 · 63%

Two clean conclusions

1. Code-specialized > general — robustly

Both code models clear the general one by a wide margin (+4 and +3) on the harder set, where the general model stumbles on the multi-step tasks (counting, digit summing, bracket balancing). The GuildLM thesis — a small model trained for the domain beats a general model of the same size — holds, and holds more convincingly the harder the task gets.

2. The win is the base, not our fine-tuning

On the hard tasks, the untuned base (19/24) edges our fine-tuned model (18/24). Our ~200-example SFT, drawn from simpler functions, slightly degrades the base on the hardest problems (Roman numerals, RLE decoding) where its raw, pretraining-deep capability matters most.

This is the central, hard-won finding of the whole arc — and it is not the one we set out to prove. For Go code generation, the lever that matters is choosing a code-specialized base over a general one. A few hundred fine-tuning examples do not improve an already-strong coder; they only helped on the one task the base was actually weak at — writing tests (Report #2).

So where does the real gain come from?

Two places, both bigger than per-role fine-tuning:

Base choice — a code base over a general one is a free, deterministic +3 to +4. Start there.
The agent algorithm — wrapping the model in a compile-and-test feedback loop (generate → build → vet → test → fix), with deterministic quality gates (no third-party imports, no cross-file redeclaration, gofmt-clean, tests that actually assert). This is where a small model is made reliable, and it's noise-free because the Go toolchain is the judge.

We chased per-role fine-tuning hard, measured it honestly, and it pointed us somewhere better. That's the point of keeping a public log: the data redirects the work. Next, we double down on the agent loop and on benchmarks harder still.

All three models measured locally on an M1 Max with Apple MLX — total cloud spend: $0. Code, datasets and benchmarks: github.com/guildlm/guild-code.