Report #6 · 2026-06-29

Data was the lever: grounding beats the SFT at the project level

Report #5 showed the agent loop writing a clean little package first try. This one asks the harder question the mission actually cares about: on a real multi-file HTTP service, does the trained specialist beat the raw base — and if neither nails it, what's the lever that does? We ran four configurations through the same loop and let the Go toolchain score every one.

The test: a small stdlib HTTP service

kvservice — a 4-file, standard-library-only key/value HTTP service (an in-memory mutex-guarded store, an http.ServeMux, GET/PUT endpoints, httptest tests). It's deliberately the case where a small 7B stumbles: it reflexively reaches for a third-party router, and it gets modern net/http status contracts (201 on create, 404 on missing) wrong. Score is score_backend: build + vet + test, real toolchain, 0–3.

The result: three ways to score 2/3

Configuration	score_backend
Base 7B only (no fine-tune)	2/3
Specialists, role-routed (go-dev + go-test)	2/3
Specialists + retrieval over the 212-example go_dev corpus	2/3
Specialists + 2 focused modern-ServeMux examples	3/3

The first three all land in the same place: build and vet pass — the deterministic import gate did its job, killing the third-party-router reflex so every run stayed pure stdlib — but the tests fail on the HTTP contract, and five fix rounds don't rescue it. The trained specialist does not beat the base here. Per-role fine-tuning is again not the project-level lever — the same conclusion Report #4 reached on benchmarks.

What actually moved the number

The corpus was the bottleneck, not the algorithm. Of the 212 examples in the go_dev retrieval set, exactly one uses a modern ServeMux HTTP contract. Lexical retrieval almost never surfaces it, so the model falls back on its pre-1.22 prior. Hand the loop just two compile- and test-verified HTTP examples and it converges to 3/3 in two fix rounds — correct 201/404 handling, real assertions, still pure stdlib.

The mechanism is worth being precise about. The grounding examples did not force a particular routing syntax — the model still wrote a classic HandleFunc("/kv/", …) with a method switch. What they supplied was a coherent, known-good contract: PUT returns 201, a missing GET returns 404, the test drives the handler with httptest. Once both the implementation model and the test model are anchored to the same contract, they agree — and the toolchain goes green.

It generalizes — and it scales

One spec is an anecdote. So we repeated the experiment on two more contracts the small base also gets wrong — a stdlib JSON endpoint (jsonapi) and a race-free generic ParallelMap worker pool (pool, where a 7B routinely writes a data race or a deadlock). Same shape of result every time:

Spec	Contract	no grounding	+ verified examples
kvservice	HTTP routing & status	2/3	3/3
jsonapi	JSON encode/decode & status	2/3	3/3 (first try)
pool	race-free concurrency	2/3	3/3 (first try, `-race` clean)

Three distinct contracts, the same lift, from the same cheap intervention. And it scales in one corpus: we merged all six verified examples into a single retrieval file, and lexical retrieval still picks the right contract per spec — the ServeMux example for the router, the JSON example for the endpoint, the worker-pool example for the concurrency package. A single growing corpus serves many contracts; retrieval routes the right one. The pattern isn't one library quirk — a small model writes a correct, tested backend the moment it's grounded in a known-good example of the contract it's being asked to honour.

Where this leaves the thesis

Capability = model × algorithm, and this report sharpens what "algorithm" means when the base is fixed and small:

Deterministic gates neutralize the base's reflexes. The import gate alone turned the old 0/3 "it imports gorilla/mux" failure into a reliable build+vet pass. That floor is now free.
Retrieval grounding supplies missing knowledge — if the corpus has it. Two right examples beat 212 mostly-irrelevant ones and beat the fine-tune. The cheapest, highest-leverage investment is a small set of verified examples for the contracts the base gets wrong.
The fine-tune keeps not being the lever. At the project level, as at the benchmark level, base choice + the algorithm carry the result. The SFT earns its place only where the base is genuinely weak (writing tests — Report #2).

So the next investment is obvious and it isn't a bigger GPU bill: grow the retrieval corpus with verified examples of the exact contracts small models get wrong. Data was the lever all along.

Every configuration across both specs served and run locally on an M1 Max with Apple MLX — total cloud spend: $0. The specs, the verified retrieval corpora and the agent loop: github.com/guildlm/builder.