Report #1 · 2026-06-28

The first GuildLM Go specialist, measured honestly

Today we trained the first dedicated GuildLM go-developer model — entirely locally on an Apple M1 Max, for $0, no cloud GPU — and put it on an objective, hidden-test benchmark. The headline: a Go-specialized 7B beats a general 7B by a clear margin. The fine print, which we report in full because honesty is the point: beating an already code-specialized base is hard, and going bigger (14B) made things worse, not better.

The leaderboard

Benchmark: go_dev_bench — 19 held-out "spec → Go function" tasks, each with a hidden test. A task counts only if the model's own generated code compiles and passes its test (pass@1, scored with the real Go toolchain, no judge).

Model	Type	pass@1
go-dev v1 ★	7B coder + Go SFT (ours)	15/19 · 79%
Qwen2.5-Coder-7B	7B code base (untuned)	14/19 · 74%
Qwen2.5-Coder-14B	14B code base (4-bit)	13/19 · 68% ↓
Qwen2.5-7B-Instruct	general 7B (non-coder)	12/19 · 63%
go-dev 14B	14B coder + Go SFT	11/19 · 58% ↓↓

The thesis holds. A Go-specialized 7B (15/19) beats a general 7B (12/19) by +3 tasks — it cuts the failure count from 7 down to 4. That is the GuildLM bet, validated: for a targeted job, a small specialist beats a bigger generalist.

How it was built — $0, on a laptop

Data: ~210 Go examples authored by a teacher model and compile-verified with the real Go toolchain (every example actually builds). Quality over scraped volume.
Training: LoRA fine-tune of Qwen2.5-Coder-7B (4-bit) via Apple MLX on an M1 Max — 400 iterations, train loss 0.72 → 0.21. No cloud, no bill.
Evaluation: a custom harness runs each benchmark prompt through the model and scores the output by actually running go test against a hidden test — objective, reproducible, no human or LLM judge.

The honest findings

1. Specialist > general LLM (the win)

The Go-tuned model fixes tasks the general model can't — case-aware title-casing, element counting, digit summing with proper error wrapping. This is exactly where Go-specific training pays off.

2. Beating a code base with a little SFT is hard (the nuance)

Our baseline, Qwen2.5-Coder-7B, is itself a strong Go model — it was pretrained on huge amounts of Go. A ~200-example LoRA nudges it from 14 → 15/19: real, but marginal and near the noise floor. Most of the specialist's edge over a general model (12 → 15) comes from choosing a code base (+2), with the Go SFT adding the final point (+1).

3. Bigger was worse — the 14B negative result

We expected the 14B base to have more headroom. It didn't: at 4-bit it scored below the 7B (13 vs 14/19), and the same fine-tuning that helped the 7B actively hurt the 14B (11/19). Lesson: 4-bit quantization costs the 14B more, and LoRA hyperparameters do not transfer 7B → 14B for free. The 7B coder is the sweet spot — for now.

What's measured next

We now have an objective evaluation trifecta, all hidden-test and reproducible:

go_dev_bench — writing code (19 tasks).
go_edit_bench — editing flawed code (8 tasks).
go_test_bench — writing tests, scored by mutation testing: a test counts only if it passes the correct code and catches a planted bug.

Next: a go-test and go-review specialist on those benchmarks, closing the shared gaps the base still misses (Unicode-rune handling, JSON round-trips), and trying a higher-precision base to see how much of the ceiling is quantization. Every run, win or loss, gets reported here.

All experiments are open and reproducible. Code, datasets and benchmarks: github.com/guildlm/guild-code. Trained and measured locally on an M1 Max with Apple MLX — total cloud spend: $0.