Where specialization actually pays: the go-test model
In Report #1 the go-developer specialist barely edged its already-strong code base. Today's go-test specialist tells a different, more interesting story: on a benchmark that scores tests by whether they actually catch bugs, the specialist beats the base by a clear margin. Specialization pays — but not equally for every task.
The result
| Model | Type | bug-catch@1 |
|---|---|---|
| go-test ★ | 7B coder + test SFT (ours) | 8/13 · 62% |
| Qwen2.5-Coder-7B | 7B code base (untuned) | 6/13 · 46% |
+2 tasks, ~33% relative. The test-writing specialist writes tests that catch bugs the base's tests miss — a clearer win than code generation, where the base is already near its ceiling.
What "bug-catch@1" means — mutation testing
Judging a test by reading it is subjective. So we score tests the way a strict reviewer would — by mutation testing. For each task the model is asked to write a test for a function, and the test earns a point only if it does both:
- passes against the correct implementation (the test is valid), and
- fails against a planted buggy mutant (the test actually catches the bug).
A test that trivially passes everything — common from weaker models — scores zero, because it never catches the mutant. This is an objective, reproducible signal: no human, no LLM judge, just the Go toolchain run twice per task.
The lesson: specialization is task-dependent
Put Report #1 and #2 side by side and a pattern appears:
- Code generation (go-dev): the coder base is already excellent — Go was a huge part of its pretraining — so a few hundred SFT examples move it only marginally (+1 task).
- Test writing (go-test): a narrower discipline — cover the edge cases, assert the boundaries, actually catch the bug. The base is not saturated here, so targeted SFT helps more (+2 tasks, and a bigger relative jump).
This is the real shape of the GuildLM bet, sharpened by data: a small specialist's edge is biggest on the focused, disciplined tasks — test writing, code review — and smallest on raw generation where a strong code base already shines. Build the guild where it pays.
The evaluation trifecta
Every GuildLM Go specialist is now measured by an objective, hidden-signal benchmark:
go_dev_bench(19) — generate code, scored by a hidden test.go_edit_bench(8) — fix flawed code, scored by a hidden test.go_test_bench(13) — write tests, scored by mutation (pass-correct + fail-mutant).
Next up: the go-review specialist on the edit benchmark, and pushing every model past the shared gaps the base still misses. Reported here, win or loss, as we go.