Sonos Data and AI Leader Manushi Sheth on Why Small-Model AI Products Fail Quietly

A demo never tells you the truth about a data pipeline. It tells you the pipeline worked once, on inputs the builder chose, on a Tuesday afternoon. The failure that matters arrives three weeks later, when a column changes type upstream, a vendor renames a field, or a user uploads a file that looks nothing like the sample data — and the model, asked to reason over malformed input it was never shown, produces an answer that is confident, plausible, and wrong. Nobody notices, because there is no error. There is only a slow drift away from correctness that no dashboard is watching.

Manushi Sheth, who leads the Product Data team at Sonos across data engineering, analytics, and machine learning, evaluated a slate of sub-billion-parameter AI projects at the Garage Inference Hackathon — and kept returning to the same question: not whether the model was clever, but whether the data around it would survive contact with production.

Garage Inference, organized by Hackathon Raptors, challenged teams to build genuinely useful tools on “garage-grade” language models — models small enough to run on a laptop, often under a billion parameters, where raw capability is scarce and every other engineering decision has to compensate. Thirty-seven teams shipped projects that, on a small model, had no business working as well as they did.

Sheth approached them the way she approaches systems at Sonos, where the Product Data team owns the data layer beneath product analytics and machine learning for millions of connected devices. Her vantage point is unusual for a hackathon judge: she does not start from the model. She starts from the data contract — the implicit promise about what shape the inputs will take, who guarantees that shape, and what happens when the guarantee breaks. “A small model is a magnifying glass held over your data,” she explains. “Whatever discipline you skipped, it finds and amplifies. A big model can paper over a messy schema with sheer pattern-matching. A 600-million-parameter model cannot. It will do exactly what your data quality lets it do, and not one step more.”

That framing turns out to be the sharpest lens available on a field that mostly markets itself on model size. The interesting story at Garage Inference was not which team used the smallest model. It was which teams understood that the model was the cheap part — and that the expensive, unglamorous, career-defining work lives in the data that flows in and out of it.

The 40 Percent Problem

The clearest illustration came from a project called Nezz, a data-visualization tool built on Qwen 3 0.6B running entirely offline. The team was refreshingly honest in its own writeup: the model, left alone, gets the analysis right only about 40 percent of the time. What lifted it to something usable was a layer of structured XML prompting that constrained the model’s output into a shape the rest of the system could trust.

Sheth treats that 40 percent figure as the most important number in the project, and not as an embarrassment. “They told you the model’s raw accuracy, which almost nobody does, and then they showed you the engineering that closed the gap,” she notes. “That gap — between what the model produces and what the system can rely on — is exactly where data debt accumulates. If you don’t measure it, you don’t know it’s there. You just ship the 40 percent and wonder why users churn.”

Her concern is what happens to a tool like Nezz when the input distribution moves. A CSV uploaded by the developer is clean. A CSV exported from a decade-old enterprise reporting system has merged cells, inconsistent date formats, and a header row that starts on line four. “The structured prompt is doing real work, but it’s reasoning about the data the team imagined,” she observes. “The production question is whether there’s a validation layer in front of the model that catches the input you didn’t imagine — because on a small model, a malformed input doesn’t degrade gracefully. It fails silently and fluently.”

That phrase — fails silently and fluently — is, in Sheth’s reading, the defining risk of the entire small-model category. A model with no capacity to flag its own uncertainty will narrate a wrong answer with the same fluency as a right one.

Extract, Then Verify

If Nezz exposed the problem, a project called Distill v2.0 modeled the discipline Sheth wanted to see. Built around the same tiny Qwen model, Distill’s architecture was organized around a single principle the team stated plainly: the model extracts, but Distill verifies. The model proposes structured fields pulled from messy documents; a separate, deterministic layer checks those fields before anything downstream trusts them.

“This is the data contract made explicit,” Sheth says. “The model is treated as an untrusted producer, and there’s a verification boundary between it and the rest of the system. That’s not a small-model pattern — that’s how you should treat any probabilistic component in a data pipeline, including a frontier model. The teams that internalized this at the garage scale are going to build more reliable systems at every scale.”

She is careful to separate the architecture from its current completeness. A verification layer is only as good as the rules it encodes, and a hackathon build inevitably encodes the obvious ones. But the shape is right, and the shape is what compounds. “You can always add validation rules. What you can’t easily retrofit is a system that was designed to trust the model implicitly. That’s the data debt that sinks products in month two — not a missing feature, but an architecture that assumed the data was clean when it never was.”

The Scaffolding Thesis, Examined

Two of the strongest projects in Sheth’s batch attacked code review with the same conviction. RAG Tag’s “Atomic PR Surgeon” ran an ensemble of four specialized micro-agents on a 600-million-parameter model to catch SQL injection, N+1 queries, and cross-file vulnerabilities, then generate fix patches. TinyFlow AI made the underlying argument its explicit thesis: engineering scaffolding, not model size, is the true multiplier of capability.

Sheth agrees with the thesis and then pressures it where it is weakest. “Scaffolding multiplies capability, yes. It also multiplies your data surface,” she explains. “Every agent in that ensemble consumes context, produces output, and passes state to the next stage. Each handoff is a place where the data contract can silently break — where agent two assumes a field that agent one stopped producing. The orchestration is the clever part. The schema between the agents is the part that will page you at 2 a.m.”

It is a characteristically data-first reframing. Where the teams saw an architecture of cooperating models, Sheth saw an internal data pipeline with several undocumented interfaces — and she has spent enough years operating those pipelines to know that the interfaces, not the components, are where reliability is won or lost. “The question I’d ask the RAG Tag team isn’t ‘does the ensemble work,'” she says. “It’s ‘what does agent three do when agent two hands it a half-formed object?’ If the answer is ‘it’s never happened in testing,’ that’s not an answer. That’s data debt with a due date.”

When the Domain Raises the Stakes

The project that drew Sheth’s most pointed analysis was TaxPlain, a tool aimed at the 864,000 UK sole traders and landlords newly subject to quarterly digital tax filing. The team built it to explain an unfamiliar regulation to people who cannot afford an accountant. Sheth scored it carefully, noting in her own evaluation that the project shipped without a presentation and that she assessed it by using the tool directly rather than reading a pitch.

Her caution here is instructive, because it is about consequences rather than craft. “The moment your small model is reasoning about someone’s tax obligation, the data debt becomes a liability question,” she says. “A visualization tool that’s wrong 5 percent of the time is annoying. A tax tool that’s wrong 5 percent of the time, on a small model with no calibrated uncertainty, is giving 43,000 people confidently incorrect guidance about money they owe the government.”

Her recommendation is not that hackathon teams avoid high-stakes domains — it is that high-stakes domains demand the verification layer be the first thing built, not the last. “In regulated data, you earn the right to use the model by building the guardrails first. The teams that treated security and validation as a tenth-place criterion had it backwards. In the domains where this technology will actually matter — health, finance, tax — that’s the first criterion, not the last.”

Lineage Is the Feature Nobody Demos

There is a category of data work that never appears in a hackathon presentation because it produces nothing a user can see: lineage — the recorded chain of where each piece of data came from and what transformed it on the way to the model. Sheth spends a meaningful share of her time at Sonos on exactly this invisible plumbing, and she watched for traces of it across the batch.

“Lineage is the thing you don’t appreciate until an answer is wrong and someone asks you why,” she says. “On a big-model product, you can sometimes get away without it because the model is robust enough to absorb upstream noise. On a small model, when the output is wrong, the cause is almost always upstream — a field that changed meaning, a transform that silently dropped rows, a join that fanned out. Without lineage, you’re debugging a probabilistic model by guessing. With it, you can trace the wrong answer back to the exact step where the data went bad.”

She points to the projects that built explicit, inspectable intermediate stages — proof panels, validation logs, the survival game last-worlder’s “proof panels” and human decision gates among them — as quietly modeling the right instinct. “The teams that surfaced what the system did at each step, rather than just the final answer, were building a primitive form of lineage. They probably thought of it as a debug view. It’s actually the foundation of being able to operate the thing. The debug view and the audit log are the same artifact viewed at different moments.”

Her broader point is that observability of data is not a production luxury bolted on later — it is a design decision that has to be made at the start, because it determines whether the system can ever be debugged at all. “You can’t add lineage retroactively to a pipeline that threw away its intermediate state. That’s another form of the debt — you saved a little engineering today and mortgaged your ability to ever understand a failure tomorrow.”

A Framework for Data Debt at the Garage Scale

Across the batch, Sheth’s evaluations cohered into something close to a checklist — a way of asking whether a small-model product has hidden data debt before it ships:

Measure the raw gap. Know your model’s unassisted accuracy on real inputs, the way the Nezz team did. The distance between that number and your system’s reliability is the size of the debt you are taking on. An unmeasured gap is an unbudgeted liability.

Put a contract on every model boundary. Every place the model produces output that something else consumes — a downstream stage, an agent, a user-facing field — needs an explicit shape and a verifier, as Distill v2.0 demonstrated. Implicit contracts are debt that compounds with every new consumer.

Treat malformed input as the common case, not the edge case. Small models fail silently and fluently on inputs outside their imagined distribution. The validation layer in front of the model matters more than the prompt inside it.

Match the guardrails to the blast radius. A satirical companion app and a tax-advice tool can use the same 600-million-parameter model and require entirely different amounts of verification. The domain sets the bar, not the model.

“None of this is exotic,” Sheth says. “It’s the same data discipline that separates an analytics dashboard people trust from one they quietly stop opening. The garage constraint just makes the consequences arrive faster. You find out in a weekend what would otherwise take you a quarter to learn in production.”

She is also clear that the checklist is not a counsel of perfection aimed at hackathon teams who had 72 hours. It is a way of seeing, and the teams that demonstrated even a partial version of it had, in her assessment, already absorbed the lesson that matters most. “I’m not asking a weekend project to have a full data governance stack. I’m asking whether the team understood, at the level of instinct, that the model is the part you can least trust. The ones who built a verifier, who measured their raw accuracy, who kept their intermediate state — they got it. You can teach someone the specific tools. You can’t easily teach the instinct, and these teams either had it or they didn’t, and it showed in the architecture long before it showed in the scores.”

Why the Constraint Is a Gift

For Sheth, the lasting value of an event like Garage Inference is precisely that it strips away the cushion that large models provide. A frontier model’s surplus capability lets teams defer their data problems; a garage model forces them to confront those problems on day one, when they are cheap to fix.

“The teams here got a compressed, accelerated lesson in the thing that actually breaks AI products,” she says. “It’s almost never the model. It’s the assumption that the data feeding the model and the data flowing out of it will behave. Build for the day that assumption fails, and you’ve built something that survives. Skip it, and you’ve built a very good demo.”

That, in the end, is the uncomfortable truth a small model tells you that a large one will let you ignore. The model is the cheap part. The data discipline around it is the product — and the bill for skipping it always comes due, quietly, after the applause.

Garage Inference Hackathon 2026 was organized by Hackathon Raptors, a Community Interest Company supporting innovation in software development. The event challenged 37 teams to build useful tools on sub-billion-parameter language models. Manushi Sheth, who leads the Product Data team at Sonos across data engineering, analytics, and machine learning, served as a judge evaluating projects for technical execution, practical usefulness, and the engineering rigor that lets small models outp