Research Log

Every experiment, step by step — what we trained, how we measured it, and what the numbers actually said (wins and losses). All runs are local, $0, and reproducible.

2026-06-29 · Report #6

Data was the lever: grounding beats the SFT at the project level

Base vs specialist through the same loop on a real HTTP service — both stall at 2/3. The fix wasn't a bigger model or the fine-tune; it was two verified retrieval examples (→ 3/3). The corpus, not the algorithm, was the bottleneck.

2026-06-29 · Report #5

The algorithm runs: specialists + agent loop = a green backend

Two trained Go specialists served live ($0, MLX), wired into the Builder with role-routing — one writes the code, another the tests. It builds a verified backend first try (score 3/3). A second spec exposes the real gap: the base model's knowledge, not the scaffolding. Plus two shipped upgrades.

2026-06-28 · Report #4

On a harder benchmark, the picture sharpens

We made the benchmark harder to cut noise. Clean result: a code-specialized 7B beats a general one robustly (+3/+4) — but the win is the base, not our fine-tuning. The lever for code-gen is base choice + the agent loop.

2026-06-28 · Report #3

The guild, measured: three specialists, three lessons

All three Go specialists trained and benchmarked, $0. One clear win (go-test +2), one marginal (go-dev), and one honest task-mismatch (go-review) that taught us to measure each specialist on its own job.

2026-06-28 · Report #2

Where specialization actually pays: the go-test model

The go-test specialist beats its base at catching planted bugs (8/13 vs 6/13) on a mutation-testing benchmark — a clearer win than code generation, and a lesson in where small-model specialization pays.

2026-06-28 · Report #1

The first GuildLM Go specialist, measured honestly

Trained a go-developer 7B locally for $0. On a 19-task hidden-test benchmark it beats a general 7B (15/19 vs 12/19) — the thesis validated. Plus an honest negative: the 14B base did worse, not better.