Research Log
Every experiment, step by step — what we trained, how we measured it, and what the numbers actually said (wins and losses). All runs are local, $0, and reproducible.
2026-06-29 · Report #6
Data was the lever: grounding beats the SFT at the project level
Base vs specialist through the same loop on a real HTTP service — both stall at 2/3. The fix wasn't a
bigger model or the fine-tune; it was two verified retrieval examples (→ 3/3). The corpus, not the
algorithm, was the bottleneck.
2026-06-29 · Report #5
The algorithm runs: specialists + agent loop = a green backend
Two trained Go specialists served live ($0, MLX), wired into the Builder with role-routing — one
writes the code, another the tests. It builds a verified backend first try (score 3/3). A second spec
exposes the real gap: the base model's knowledge, not the scaffolding. Plus two shipped upgrades.
2026-06-28 · Report #4
On a harder benchmark, the picture sharpens
We made the benchmark harder to cut noise. Clean result: a code-specialized 7B beats a general one
robustly (+3/+4) — but the win is the base, not our fine-tuning. The lever for code-gen is base
choice + the agent loop.
2026-06-28 · Report #3
The guild, measured: three specialists, three lessons
All three Go specialists trained and benchmarked, $0. One clear win (go-test +2), one marginal
(go-dev), and one honest task-mismatch (go-review) that taught us to measure each specialist on its
own job.
2026-06-28 · Report #2
Where specialization actually pays: the go-test model
The go-test specialist beats its base at catching planted bugs (8/13 vs 6/13) on a mutation-testing
benchmark — a clearer win than code generation, and a lesson in where small-model specialization pays.
2026-06-28 · Report #1
The first GuildLM Go specialist, measured honestly
Trained a go-developer 7B locally for $0. On a 19-task hidden-test benchmark it beats a general 7B
(15/19 vs 12/19) — the thesis validated. Plus an honest negative: the 14B base did worse, not better.