← Research log
Report #1 · 2026-06-28

The first GuildLM Go specialist, measured honestly

Today we trained the first dedicated GuildLM go-developer model — entirely locally on an Apple M1 Max, for $0, no cloud GPU — and put it on an objective, hidden-test benchmark. The headline: a Go-specialized 7B beats a general 7B by a clear margin. The fine print, which we report in full because honesty is the point: beating an already code-specialized base is hard, and going bigger (14B) made things worse, not better.

The leaderboard

Benchmark: go_dev_bench — 19 held-out "spec → Go function" tasks, each with a hidden test. A task counts only if the model's own generated code compiles and passes its test (pass@1, scored with the real Go toolchain, no judge).

ModelTypepass@1
go-dev v1 7B coder + Go SFT (ours)15/19 · 79%
Qwen2.5-Coder-7B7B code base (untuned)14/19 · 74%
Qwen2.5-Coder-14B14B code base (4-bit)13/19 · 68%
Qwen2.5-7B-Instructgeneral 7B (non-coder)12/19 · 63%
go-dev 14B14B coder + Go SFT11/19 · 58% ↓↓

The thesis holds. A Go-specialized 7B (15/19) beats a general 7B (12/19) by +3 tasks — it cuts the failure count from 7 down to 4. That is the GuildLM bet, validated: for a targeted job, a small specialist beats a bigger generalist.

How it was built — $0, on a laptop

The honest findings

1. Specialist > general LLM (the win)

The Go-tuned model fixes tasks the general model can't — case-aware title-casing, element counting, digit summing with proper error wrapping. This is exactly where Go-specific training pays off.

2. Beating a code base with a little SFT is hard (the nuance)

Our baseline, Qwen2.5-Coder-7B, is itself a strong Go model — it was pretrained on huge amounts of Go. A ~200-example LoRA nudges it from 14 → 15/19: real, but marginal and near the noise floor. Most of the specialist's edge over a general model (12 → 15) comes from choosing a code base (+2), with the Go SFT adding the final point (+1).

3. Bigger was worse — the 14B negative result

We expected the 14B base to have more headroom. It didn't: at 4-bit it scored below the 7B (13 vs 14/19), and the same fine-tuning that helped the 7B actively hurt the 14B (11/19). Lesson: 4-bit quantization costs the 14B more, and LoRA hyperparameters do not transfer 7B → 14B for free. The 7B coder is the sweet spot — for now.

What's measured next

We now have an objective evaluation trifecta, all hidden-test and reproducible:

Next: a go-test and go-review specialist on those benchmarks, closing the shared gaps the base still misses (Unicode-rune handling, JSON round-trips), and trying a higher-precision base to see how much of the ceiling is quantization. Every run, win or loss, gets reported here.

All experiments are open and reproducible. Code, datasets and benchmarks: github.com/guildlm/guild-code. Trained and measured locally on an M1 Max with Apple MLX — total cloud spend: $0.