Data was the lever: grounding beats the SFT at the project level
Report #5 showed the agent loop writing a clean little package first try. This one asks the harder question the mission actually cares about: on a real multi-file HTTP service, does the trained specialist beat the raw base — and if neither nails it, what's the lever that does? We ran four configurations through the same loop and let the Go toolchain score every one.
The test: a small stdlib HTTP service
kvservice — a 4-file, standard-library-only key/value HTTP service
(an in-memory mutex-guarded store, an http.ServeMux, GET/PUT
endpoints, httptest tests). It's deliberately the case where a small 7B
stumbles: it reflexively reaches for a third-party router, and it gets modern
net/http status contracts (201 on create, 404 on missing) wrong.
Score is score_backend: build + vet + test, real toolchain, 0–3.
The result: three ways to score 2/3
| Configuration | score_backend |
|---|---|
| Base 7B only (no fine-tune) | 2/3 |
| Specialists, role-routed (go-dev + go-test) | 2/3 |
| Specialists + retrieval over the 212-example go_dev corpus | 2/3 |
| Specialists + 2 focused modern-ServeMux examples | 3/3 |
The first three all land in the same place: build and
vet pass — the deterministic import gate did its job, killing the
third-party-router reflex so every run stayed pure stdlib — but the tests fail
on the HTTP contract, and five fix rounds don't rescue it. The trained
specialist does not beat the base here. Per-role fine-tuning is again
not the project-level lever — the same conclusion
Report #4 reached on benchmarks.
What actually moved the number
The corpus was the bottleneck, not the algorithm. Of the 212 examples in the go_dev retrieval set, exactly one uses a modern ServeMux HTTP contract. Lexical retrieval almost never surfaces it, so the model falls back on its pre-1.22 prior. Hand the loop just two compile- and test-verified HTTP examples and it converges to 3/3 in two fix rounds — correct 201/404 handling, real assertions, still pure stdlib.
The mechanism is worth being precise about. The grounding examples did not force
a particular routing syntax — the model still wrote a classic
HandleFunc("/kv/", …) with a method switch. What they supplied was a
coherent, known-good contract: PUT returns 201, a missing GET returns
404, the test drives the handler with httptest. Once both the
implementation model and the test model are anchored to the same contract, they
agree — and the toolchain goes green.
It generalizes — and it scales
One spec is an anecdote. So we repeated the experiment on two more contracts the
small base also gets wrong — a stdlib JSON endpoint (jsonapi) and a
race-free generic ParallelMap worker pool (pool, where
a 7B routinely writes a data race or a deadlock). Same shape of result every
time:
| Spec | Contract | no grounding | + verified examples |
|---|---|---|---|
| kvservice | HTTP routing & status | 2/3 | 3/3 |
| jsonapi | JSON encode/decode & status | 2/3 | 3/3 (first try) |
| pool | race-free concurrency | 2/3 | 3/3 (first try, -race clean) |
Three distinct contracts, the same lift, from the same cheap intervention. And it scales in one corpus: we merged all six verified examples into a single retrieval file, and lexical retrieval still picks the right contract per spec — the ServeMux example for the router, the JSON example for the endpoint, the worker-pool example for the concurrency package. A single growing corpus serves many contracts; retrieval routes the right one. The pattern isn't one library quirk — a small model writes a correct, tested backend the moment it's grounded in a known-good example of the contract it's being asked to honour.
Where this leaves the thesis
Capability = model × algorithm, and this report sharpens what "algorithm" means when the base is fixed and small:
- Deterministic gates neutralize the base's reflexes. The import gate alone turned the old 0/3 "it imports gorilla/mux" failure into a reliable build+vet pass. That floor is now free.
- Retrieval grounding supplies missing knowledge — if the corpus has it. Two right examples beat 212 mostly-irrelevant ones and beat the fine-tune. The cheapest, highest-leverage investment is a small set of verified examples for the contracts the base gets wrong.
- The fine-tune keeps not being the lever. At the project level, as at the benchmark level, base choice + the algorithm carry the result. The SFT earns its place only where the base is genuinely weak (writing tests — Report #2).
So the next investment is obvious and it isn't a bigger GPU bill: grow the retrieval corpus with verified examples of the exact contracts small models get wrong. Data was the lever all along.