Report #5 · 2026-06-29

The algorithm runs: specialists + agent loop = a green backend

The previous four reports measured models on benchmarks. This one runs the actual product: two trained Go specialists, each served as a live endpoint, wired into the Builder's agent loop with role-routing — one model writes the implementation, a different one writes the tests. The whole thing runs on an M1 Max with Apple MLX. Cloud spend: still $0. We asked it for a small Go package and watched what came out.

The setup: a guild, not a model

Two LoRA-tuned specialists over the same Qwen2.5-Coder-7B base, each served OpenAI-compatible via mlx_lm.server:

go-dev on :8080 — writes implementation files.
go-test on :8081 — writes *_test.go files.

The Builder decomposes a spec into files, routes each file to the specialist that owns it, and runs the verification loop: generate → go build → vet → test → fix, with deterministic quality gates (stdlib-only imports, no cross-file redeclaration, gofmt-clean, tests that actually assert). The headline metric is score_backend: does the whole generated project build, vet and test with the real Go toolchain.

The result: first-try green

Given a 3-file spec for numkit (a pure-stdlib package with GCD and Clamp plus table-driven tests), the guild produced a working, verified backend on the first pass — no fix rounds, 17 seconds:

Stage	Result
go build	✓
go vet	✓
go test	✓
score_backend	3/3

The implementation was correct on the tricky edges — GCD(0,0)=0, negative operands via abs, Clamp with inverted bounds returning the input unchanged. The test specialist wrote a real table covering coprimes, a common factor, a zero operand, negatives, and clamp boundaries — with genuine t.Errorf assertions, all passing. Two models that never share a context window, cooperating into one green package. That is the bet — capability = model × algorithm — working end to end.

The honest part: where it broke, and why

We first ran the same pipeline on stringkit — Reverse and IsPalindrome with unicode test cases. The implementation was flawless (go build green immediately). But the suite failed, and it kept failing through every fix round. The reason is worth the whole report:

The test specialist encoded factually wrong expectations and could not be argued out of them. It asserted that Reverse("こんにちは") equals itself — it does not — and that a katakana string was a palindrome when it is not. The implementation was right; the test was wrong.

We threw the algorithm at it. Best-of-N resampling, fifteen candidates across five fix rounds — every single sample re-emitted the same wrong value. We cross-routed the repair to the other specialist (go-dev), explicitly told it "the implementation is correct, fix the expected value" — it produced the identical wrong expectation. The tell: both adapters share one Qwen2.5-Coder-7B base, and that base simply does not know how a unicode string reverses. No amount of scaffolding around a model can supply knowledge the model does not have.

This is a clarifying failure, not a discouraging one. The algorithm did its job: it localized the fault to the test file (the build was green; only the assertion failed) and refused to corrupt a correct implementation to satisfy a wrong test. The residual gap is a base-model knowledge limit, not an algorithm limit — the same lesson Report #4 reached from the benchmark side, now reached from the product side.

Two upgrades the demo earned

Running the real loop surfaced two concrete improvements, both now shipped and unit-tested in the Builder:

A goimports gate. The most common small-model failure is a single dead import that blocks the entire compile. We made file formatting prefer goimports over gofmt, so unused imports are pruned deterministically — no model fix-round spent on a one-line cleanup the toolchain can do for free. (On stringkit this alone turned the implementation green.)
Verification-guided fix selection. Best-of-N during repair used to keep the first candidate that parses. Now it keeps the first candidate that actually makes go build/vet/test pass — the ground-truth signal applied at selection time, not just after. A model that keeps re-emitting a broken fix gets out-voted by the one sample that goes green.

Where this leaves the thesis

Four reports established that for Go code-gen the base matters more than per-role fine-tuning. This one shows the other half: once you have a decent base, the algorithm — decompose, role-route, verify, gate, loop — turns two small 7B specialists into a system that writes a correct, tested backend on the first try. And when it can't, it fails in the honest direction: it tells you the gap is in the model's knowledge, and points the next investment (a stronger code base, or oracle-grounded repair) at the right place.

Specialists served and the full Builder loop run locally on an M1 Max with Apple MLX — total cloud spend: $0. The agent loop, gates and specs: github.com/guildlm/builder.