← Research log
Report #2 · 2026-06-28

Where specialization actually pays: the go-test model

In Report #1 the go-developer specialist barely edged its already-strong code base. Today's go-test specialist tells a different, more interesting story: on a benchmark that scores tests by whether they actually catch bugs, the specialist beats the base by a clear margin. Specialization pays — but not equally for every task.

The result

ModelTypebug-catch@1
go-test 7B coder + test SFT (ours)8/13 · 62%
Qwen2.5-Coder-7B7B code base (untuned)6/13 · 46%

+2 tasks, ~33% relative. The test-writing specialist writes tests that catch bugs the base's tests miss — a clearer win than code generation, where the base is already near its ceiling.

What "bug-catch@1" means — mutation testing

Judging a test by reading it is subjective. So we score tests the way a strict reviewer would — by mutation testing. For each task the model is asked to write a test for a function, and the test earns a point only if it does both:

A test that trivially passes everything — common from weaker models — scores zero, because it never catches the mutant. This is an objective, reproducible signal: no human, no LLM judge, just the Go toolchain run twice per task.

The lesson: specialization is task-dependent

Put Report #1 and #2 side by side and a pattern appears:

This is the real shape of the GuildLM bet, sharpened by data: a small specialist's edge is biggest on the focused, disciplined tasks — test writing, code review — and smallest on raw generation where a strong code base already shines. Build the guild where it pays.

The evaluation trifecta

Every GuildLM Go specialist is now measured by an objective, hidden-signal benchmark:

Next up: the go-review specialist on the edit benchmark, and pushing every model past the shared gaps the base still misses. Reported here, win or loss, as we go.

Trained and measured locally on an M1 Max with Apple MLX — total cloud spend: $0. Code, datasets and benchmarks: github.com/guildlm/guild-code.