The first GuildLM Go specialist, measured honestly
Today we trained the first dedicated GuildLM go-developer model — entirely locally on an Apple M1 Max, for $0, no cloud GPU — and put it on an objective, hidden-test benchmark. The headline: a Go-specialized 7B beats a general 7B by a clear margin. The fine print, which we report in full because honesty is the point: beating an already code-specialized base is hard, and going bigger (14B) made things worse, not better.
The leaderboard
Benchmark: go_dev_bench — 19 held-out "spec → Go function" tasks, each with a
hidden test. A task counts only if the model's own generated code compiles and
passes its test (pass@1, scored with the real Go toolchain, no judge).
| Model | Type | pass@1 |
|---|---|---|
| go-dev v1 ★ | 7B coder + Go SFT (ours) | 15/19 · 79% |
| Qwen2.5-Coder-7B | 7B code base (untuned) | 14/19 · 74% |
| Qwen2.5-Coder-14B | 14B code base (4-bit) | 13/19 · 68% ↓ |
| Qwen2.5-7B-Instruct | general 7B (non-coder) | 12/19 · 63% |
| go-dev 14B | 14B coder + Go SFT | 11/19 · 58% ↓↓ |
The thesis holds. A Go-specialized 7B (15/19) beats a general 7B (12/19) by +3 tasks — it cuts the failure count from 7 down to 4. That is the GuildLM bet, validated: for a targeted job, a small specialist beats a bigger generalist.
How it was built — $0, on a laptop
- Data: ~210 Go examples authored by a teacher model and compile-verified with the real Go toolchain (every example actually builds). Quality over scraped volume.
- Training: LoRA fine-tune of
Qwen2.5-Coder-7B(4-bit) via Apple MLX on an M1 Max — 400 iterations, train loss 0.72 → 0.21. No cloud, no bill. - Evaluation: a custom harness runs each benchmark prompt through the model and
scores the output by actually running
go testagainst a hidden test — objective, reproducible, no human or LLM judge.
The honest findings
1. Specialist > general LLM (the win)
The Go-tuned model fixes tasks the general model can't — case-aware title-casing, element counting, digit summing with proper error wrapping. This is exactly where Go-specific training pays off.
2. Beating a code base with a little SFT is hard (the nuance)
Our baseline, Qwen2.5-Coder-7B, is itself a strong Go model — it was pretrained
on huge amounts of Go. A ~200-example LoRA nudges it from 14 → 15/19: real, but marginal and near the
noise floor. Most of the specialist's edge over a general model (12 → 15) comes from
choosing a code base (+2), with the Go SFT adding the final point (+1).
3. Bigger was worse — the 14B negative result
We expected the 14B base to have more headroom. It didn't: at 4-bit it scored below the 7B (13 vs 14/19), and the same fine-tuning that helped the 7B actively hurt the 14B (11/19). Lesson: 4-bit quantization costs the 14B more, and LoRA hyperparameters do not transfer 7B → 14B for free. The 7B coder is the sweet spot — for now.
What's measured next
We now have an objective evaluation trifecta, all hidden-test and reproducible:
go_dev_bench— writing code (19 tasks).go_edit_bench— editing flawed code (8 tasks).go_test_bench— writing tests, scored by mutation testing: a test counts only if it passes the correct code and catches a planted bug.
Next: a go-test and go-review specialist on those benchmarks, closing the shared gaps the base still misses (Unicode-rune handling, JSON round-trips), and trying a higher-precision base to see how much of the ceiling is quantization. Every run, win or loss, gets reported here.