← Research log
Report #3 · 2026-06-28

The guild, measured: three specialists, three lessons

All three Go specialists — dev, test, review — are now trained (locally, $0) and measured on objective, hidden-signal benchmarks. The result isn't a clean sweep, and we wouldn't publish it if it were dishonest: one clear win, one marginal win, and one result that taught us more by losing than a win would have.

The scoreboard

SpecialistBenchmarkvs baseVerdict
go-testmutation (bug-catch@1)8/13 vs 6/13clear win +2
go-devgenerate (pass@1)15/19 vs 14/19marginal +1 (but +3 vs a general LLM)
go-reviewidentify (identify@1)6/8 vs 7/8lost −1 — base already strong

Lesson 1 — specialization is strongest on disciplined tasks

The go-test model is the cleanest win (+2, see Report #2). Test-writing is a discipline — cover the edges, catch the bug — and the base isn't saturated there, so targeted training moves the needle. go-dev wins only marginally over its base, because the code base is already a Go powerhouse from pretraining; the bigger gap (+3) is against a general model, which is the real GuildLM claim (see Report #1).

Lesson 2 — measure every specialist on its own job (and follow the data even when it stings)

go-review scored below its base on the edit benchmark. Our first instinct was that the benchmark was wrong: editing (emit a corrected file) is not reviewing (identify the defect). So we built the right one — go_review_bench, which scores a review only if it names the real bug — and re-ran it. The honest result:

ModelReview benchmark (identify@1)
Qwen2.5-Coder-7B (base)7/8
go-review (tuned)6/8

It wasn't (only) the benchmark — the SFT genuinely didn't help. Even on its own task, the tuned model loses to the base. The coder base is already an excellent Go reviewer, and 73 review examples didn't add capability. We'd rather correct ourselves in public than leave a flattering-but-wrong claim standing.

Lesson 3 — SFT helps only where the base is weak

Line the three up and the real pattern is unmistakable:

So a few hundred SFT examples do not blanket-improve a strong code base — they help only on the task it's actually weak at. The real levers for "the world's best Go models" are: pick a code base over a general one (proven: it beats a general LLM by +3), fine-tune on the specific weak task, and — most of all — the agent algorithm that wraps these models in a compile-and-test feedback loop. Brute-force per-role SFT is not it.

What's next

Every run, win or loss, reported here as it happens.

Three specialists trained and measured locally on an M1 Max with Apple MLX — total cloud spend: $0. Code, datasets and benchmarks: github.com/guildlm/guild-code.