Introduction
At 789 Labs, we have the luxury of not only building AI products for our clients but being involved after they are released. This means that we don't just build to ship, we build to maintain.
In this post, we'll walk through a real incident where our client was receiving customer complaints but couldn't reproduce the issue with traditional or vibe-based testing and how we tracked down the issue by simulating their problematic customer transactions at scale.
Background
Let me give you context about the client’s product. We’ll call our client “the Company.” The product we built for the Company extracts structured data across thousands of conversations at a time.
Each extracted field can be searched or aggregated by the end user. In the product, customers define extractors. Each extractor is a small spec that we run over many conversations.
The focus of this post is what happened inside that extraction mechanism.
The Incident
It was Tuesday at 9:30am and my phone rings while I am settling in with my morning coffee. It’s the Company. A new customer they were onboarding was doing early runs in the field, and data was being extracted, sometimes, but not all the time.
The same input would occasionally produce a valid structured response, and other times the model output was empty, which the system interpreted as “no output.”
Not a great first impression, so time mattered.
We asked the Company to gather whatever they were able to share.
Because they operate in a highly regulated field, the best information they could provide was the extractor configurations the customer had created in the product, a single test conversation created by the team, and the expected outputs.
We ran their test conversation through their extractors and couldn’t reproduce the issue.
That’s when it became clear that basic user testing, call it “vibe-testing,” wasn’t going to cut it.
The failures were hiding in the tail. In a probabilistic system, a single run doesn’t prove much. We needed volume to make the rare, but costly, failures show up reliably.
Our Tools
Using the one conversation the Company shared as a prior, we seeded a simulation and generated thousands of variants by applying small perturbations during the generation process.
Then, we simulated thousands of customer transactions through the production extraction pipeline, and logged prompts, responses, parse outcomes, and quality checks.
With that dataset in hand, we were able to zoom out to an aerial view. Then drilled into specific rows once the pattern emerged.
Once everything was captured, we could slice the data quickly. First at a high level to see where failures clustered, then down to individual rows to inspect a single run end-to-end.
Each row was base reality. The full context from input all the way to output for one transaction.
What We Found
At scale, the failures finally showed up. The first thing we did was group by extractor, expecting one or two extractors to be the culprit. But the error rate looked surprisingly uniform.
So we kept pivoting.
Finally, we pivoted by expected output. Surprisingly, that’s where the story broke open. For one specific expected output value, the model’s completion was frequently empty. The model was terminating immediately, and our parser recorded it as “no output.”
We adjusted the prompt, re-ran the simulation (same batch & a fresh one), and eliminated that entire class of failures!
This was followed by running our updated system on our internal benchmark suite and verifying we didn’t introduce regressions elsewhere.
Conclusion
By 1pm, we had isolated the failure mode, shipped a fix, and verified it didn’t introduce regressions. The Company was able to go back to their customer with a clear explanation of what was happening, why it was intermittent, and what changed. Transparency and the speed of the fix helped rebuild trust quickly.
In the end, the reality is unavoidable: when customers adopt AI, they will push it out of distribution, and issues will arise. Success won’t hinge on whether problems occur, but on how quickly you can identify, understand, and resolve them. That speed depends entirely on the groundwork you have laid long before the incident appears.