The guild, measured: three specialists, three lessons
All three Go specialists — dev, test, review — are now trained (locally, $0) and measured on objective, hidden-signal benchmarks. The result isn't a clean sweep, and we wouldn't publish it if it were dishonest: one clear win, one marginal win, and one result that taught us more by losing than a win would have.
The scoreboard
| Specialist | Benchmark | vs base | Verdict |
|---|---|---|---|
| go-test | mutation (bug-catch@1) | 8/13 vs 6/13 | clear win +2 |
| go-dev | generate (pass@1) | 15/19 vs 14/19 | marginal +1 (but +3 vs a general LLM) |
| go-review | identify (identify@1) | 6/8 vs 7/8 | lost −1 — base already strong |
Lesson 1 — specialization is strongest on disciplined tasks
The go-test model is the cleanest win (+2, see Report #2). Test-writing is a discipline — cover the edges, catch the bug — and the base isn't saturated there, so targeted training moves the needle. go-dev wins only marginally over its base, because the code base is already a Go powerhouse from pretraining; the bigger gap (+3) is against a general model, which is the real GuildLM claim (see Report #1).
Lesson 2 — measure every specialist on its own job (and follow the data even when it stings)
go-review scored below its base on the edit benchmark. Our first instinct
was that the benchmark was wrong: editing (emit a corrected file) is not reviewing (identify the
defect). So we built the right one — go_review_bench, which scores a review only if it
names the real bug — and re-ran it. The honest result:
| Model | Review benchmark (identify@1) |
|---|---|
| Qwen2.5-Coder-7B (base) | 7/8 |
| go-review (tuned) | 6/8 ↓ |
It wasn't (only) the benchmark — the SFT genuinely didn't help. Even on its own task, the tuned model loses to the base. The coder base is already an excellent Go reviewer, and 73 review examples didn't add capability. We'd rather correct ourselves in public than leave a flattering-but-wrong claim standing.
Lesson 3 — SFT helps only where the base is weak
Line the three up and the real pattern is unmistakable:
- go-test — base 6/13, tuned 8/13. The base is weakest at test-writing, so SFT helps most. +2
- go-dev — base 14/19, tuned 15/19. Strong base, marginal gain.
- go-review — base 7/8, tuned 6/8. The base is already a great reviewer, so SFT is neutral-to-harmful. −1
So a few hundred SFT examples do not blanket-improve a strong code base — they help only on the task it's actually weak at. The real levers for "the world's best Go models" are: pick a code base over a general one (proven: it beats a general LLM by +3), fine-tune on the specific weak task, and — most of all — the agent algorithm that wraps these models in a compile-and-test feedback loop. Brute-force per-role SFT is not it.
What's next
- Build an objective go-review benchmark — buggy code in, and score whether the review names the real defect — so go-review gets a fair test.
- Close the shared gaps every model still misses (Unicode-rune handling, JSON round-trips).
- Wire the measured specialists into the agent loop (the GuildLM "builder"), where dev writes, test tests, and review checks.
Every run, win or loss, reported here as it happens.