Debugging the tail: One example, a thousand tests. 789 Labs

Introduction

At 789 Labs, we have the luxury of not only building AI products for our clients, but being involved after they are released. This means that we don’t just build to ship, we build to maintain.

In this post, we’ll walk through a real incident where our client was receiving customer complaints, but couldn’t reproduce the issue with traditional or vibe-based testing. We tracked it down by simulating their problematic customer transactions at scale.

Background

Let me give you a bit of background about the client’s product. We’ll call our client “the Company.” The product we built for the Company (simplified here) extracts structured data across thousands of conversations at a time.

Each extracted field can be searched or aggregated by the end user. In the product, customers define extractors. Each extractor is a small spec that we run over many conversations. The focus of this post is what happened inside that extraction mechanism.

The Incident

It was Tuesday at 9:30 am PST and our client called while I was settling in with my morning coffee. A new customer they were onboarding was doing early runs in the field, and data was being extracted sometimes, but not all the time.

The same input would occasionally produce a valid structured response, and other times the model output was empty, which the system interpreted as “no answer.”

Not a great first impression, so time mattered.

We asked the Company to gather whatever they were allowed to share. Because they operate in a highly regulated field, the best they could provide was: the extractor configurations the customer had created in the product, a single test conversation created by the team, and the expected outputs.

We ran their test conversation through their extractors and couldn’t reproduce the issue. That’s when it became clear that basic user testing, call it “vibe-testing”, wasn’t going to cut it. The failures were likely hiding in the tail.

In a probabilistic system, a single run doesn’t prove much. We needed volume to make the rare (but costly) failures show up reliably.

Our Tools

Using the one conversation the Company shared as a prior, we seeded a simulation and generated thousands of variants by applying small perturbations to increase variance.

We ran the batch through the simulated production extraction pipeline and model configuration, then logged prompts, responses, token usage, parse outcomes, and quality checks.

With that dataset in hand, we got a 1000-foot view first. Then drilled into specific rows once the pattern emerged.

Once everything was captured, we could slice the data quickly. First at a high level to see where failures clustered, then down to individual rows to inspect a single run end-to-end. Each row was base reality. The full context from input all the way to output for one transaction.

What We Found

At scale, the failures finally showed up. The first thing we did was group by extractor, expecting one or two extractors to be the culprit. But the error rate looked surprisingly uniform.

So we kept pivoting. Finally, we pivoted by expected output. Weirdly, that’s where the story broke open.

For one specific expected output value, the model’s completion was frequently empty. The model was terminating immediately, and our parser recorded it as “no answer.”

We adjusted the prompt, re-ran the simulation (same batch and a fresh one), and eliminated that entire class of failures.

This was followed by running our updated system on our internal benchmark suite and verifying we didn’t introduce regressions elsewhere.

Conclusion

By 1 pm, we had isolated the failure mode, shipped a fix, and verified it didn’t introduce regressions. The Company was able to go back to their customer with a clear explanation of what was happening, why it was intermittent, and what changed.

That transparency, plus the speed of the fix, helped rebuild trust quickly.

The takeaway is simple: when your customers are adopting AI, they’re going to take it out of distribution. It’s not whether issues happen. They certainly will. It’s how fast you can diagnose and correct them.