Report #2 · 2026-06-28

Where specialization actually pays: the go-test model

In Report #1 the go-developer specialist barely edged its already-strong code base. Today's go-test specialist tells a different, more interesting story: on a benchmark that scores tests by whether they actually catch bugs, the specialist beats the base by a clear margin. Specialization pays — but not equally for every task.

The result

Model	Type	bug-catch@1
go-test ★	7B coder + test SFT (ours)	8/13 · 62%
Qwen2.5-Coder-7B	7B code base (untuned)	6/13 · 46%

+2 tasks, ~33% relative. The test-writing specialist writes tests that catch bugs the base's tests miss — a clearer win than code generation, where the base is already near its ceiling.

What "bug-catch@1" means — mutation testing

Judging a test by reading it is subjective. So we score tests the way a strict reviewer would — by mutation testing. For each task the model is asked to write a test for a function, and the test earns a point only if it does both:

passes against the correct implementation (the test is valid), and
fails against a planted buggy mutant (the test actually catches the bug).

A test that trivially passes everything — common from weaker models — scores zero, because it never catches the mutant. This is an objective, reproducible signal: no human, no LLM judge, just the Go toolchain run twice per task.

The lesson: specialization is task-dependent

Put Report #1 and #2 side by side and a pattern appears:

Code generation (go-dev): the coder base is already excellent — Go was a huge part of its pretraining — so a few hundred SFT examples move it only marginally (+1 task).
Test writing (go-test): a narrower discipline — cover the edge cases, assert the boundaries, actually catch the bug. The base is not saturated here, so targeted SFT helps more (+2 tasks, and a bigger relative jump).

This is the real shape of the GuildLM bet, sharpened by data: a small specialist's edge is biggest on the focused, disciplined tasks — test writing, code review — and smallest on raw generation where a strong code base already shines. Build the guild where it pays.

The evaluation trifecta

Every GuildLM Go specialist is now measured by an objective, hidden-signal benchmark:

go_dev_bench (19) — generate code, scored by a hidden test.
go_edit_bench (8) — fix flawed code, scored by a hidden test.
go_test_bench (13) — write tests, scored by mutation (pass-correct + fail-mutant).

Next up: the go-review specialist on the edit benchmark, and pushing every model past the shared gaps the base still misses. Reported here, win or loss, as we go.

Trained and measured locally on an M1 Max with Apple MLX — total cloud spend: $0. Code, datasets and benchmarks: github.com/guildlm/guild-code.