← Research log
Report #4 · 2026-06-28

On a harder benchmark, the picture sharpens

Across Reports #1–#3 a suspicion grew: our benchmark was too easy, and the differences we were measuring were mostly noise. So we made it harder — added multi-step tasks that demand real logic (bracket matching, interval merging, Roman numerals, run-length decoding, a Caesar cipher) — and re-ran the three models. A harder test cuts noise, and the result is the cleanest statement of the whole investigation.

The result (24-task go_dev_bench)

ModelTypepass@1
Qwen2.5-Coder-7Bcode base (untuned)19/24 · 79%
go-dev (our SFT)code base + Go SFT18/24 · 75%
Qwen2.5-7B-Instructgeneral (non-coder)15/24 · 63%

Two clean conclusions

1. Code-specialized > general — robustly

Both code models clear the general one by a wide margin (+4 and +3) on the harder set, where the general model stumbles on the multi-step tasks (counting, digit summing, bracket balancing). The GuildLM thesis — a small model trained for the domain beats a general model of the same size — holds, and holds more convincingly the harder the task gets.

2. The win is the base, not our fine-tuning

On the hard tasks, the untuned base (19/24) edges our fine-tuned model (18/24). Our ~200-example SFT, drawn from simpler functions, slightly degrades the base on the hardest problems (Roman numerals, RLE decoding) where its raw, pretraining-deep capability matters most.

This is the central, hard-won finding of the whole arc — and it is not the one we set out to prove. For Go code generation, the lever that matters is choosing a code-specialized base over a general one. A few hundred fine-tuning examples do not improve an already-strong coder; they only helped on the one task the base was actually weak at — writing tests (Report #2).

So where does the real gain come from?

Two places, both bigger than per-role fine-tuning:

We chased per-role fine-tuning hard, measured it honestly, and it pointed us somewhere better. That's the point of keeping a public log: the data redirects the work. Next, we double down on the agent loop and on benchmarks harder still.

All three models measured locally on an M1 Max with Apple MLX — total cloud spend: $0. Code, datasets and benchmarks: github.com/guildlm/guild-code.