How to Build a Multi-Agent System (Part 3/3): From Evaluation to Production
- Cristian Dordea
- 3 days ago
- 13 min read

Welcome to Part 3: From Evaluation to Production (Multi-Agent System)
In Part 2, we transformed our design into a working system comprising three agents processing tickets, validation gates catching errors, and a memory architecture that keeps everything organized. However, here's the uncomfortable truth: you don't actually know if your system is working.
Sure, it runs. It generates responses. But how can we verify if those confident-sounding answers are actually correct? Without rigorous evaluation, you're essentially gambling with your company's reputation.
This is where most AI projects fail, not due to technical issues, but instead because they are unable to demonstrate their effectiveness. Teams deploy systems that look impressive in demos but fail to meet customer expectations in ways they never discover.
Examples such as the "95% accurate" system that's actually catching zero critical failures, or the helpful assistant that's confidently wrong half the time.
In this final part, we'll build the evaluation infrastructure that separates production-ready systems from expensive experiments. You'll learn how to:
catch failures before customers do
deploy with confidence
create a system that gets smarter every day
Let's turn the prototype into a product.
Phase 5: Building Your Evaluation System (Multi-Agent System)
The Critical Importance of Evals
Here's a truth that most teams learn too late: evaluations are the hidden lever that determines the success of multi-agent systems. You can have perfect prompts and the latest LLMs, but without rigorous evaluation, you're flying blind.
Step 1: Understanding Your Failure Modes First
Before creating any tests or metrics, you need to understand how your agents actually fail. This is the most overlooked, yet critical step in building reliable multi-agent systems.
Collect Real Failures
Start by gathering about 100 actual customer interactions from your pilot. Don't cherry-pick successes; focus on the disasters: negative feedback, endless back-and-forth exchanges, and escalations to humans. These edge cases that broke your system aren't anomalies; they're insights into potential issues waiting to happen.
Document How Each Agent Fails
Be specific about failures. Don't just note "agent failed." Document exactly what went wrong. In our system, the Technical Support Agent repeatedly shared incorrect API documentation links (15 times). The Billing Agent couldn't explain pro-rated charges (12 confused customers). The Classifier Agent sent refund requests to technical support (8 times). The Response Generator explained simple tasks like computer science lectures (10 bewildered users).
Find the Patterns
Once cataloged, failures reveal patterns. We found five categories covering 90% of problems:
Misclassification (30%): Wrong agent gets the request
Incomplete solutions (25%): Half-answers presented as complete
Wrong tone (20%): Talking to Fortune 500s customers like startups
State loss (15%): Agents forgetting context mid-conversation
Tool errors (10%): Calling the wrong APIs or databases
These patterns become your fix-it priority list.
Step 2: Design Your Eval Suite
Once you understand your failure modes, build targeted evals for each. The key decision: use code-based evals for objective checks and LLM-as-Judge for subjective qualities.
Binary Pass/Fail Criteria
Here's the thing about evals, ambiguity kills them. Force clarity with binary judgments. Don't use "somewhat good" or "mostly correct" ratings. Either the agent nailed it or it didn't.
For the Classifier Agent, it's simple: Did it route to the right specialist with over 70% confidence? If yes, pass. Wrong category or confidence below 70%? Fail. No negotiation.
The Technical Support Agent is only successful when they deliver the complete package: clear, numbered steps, what the customer should see after each step, and how long it will take. Missing any piece? That's a fail. "Just restart the app" isn't a solution.
Your Billing Agent needs to explain charges with actual numbers. If it can't tell a customer exactly why their bill is $47.83 instead of $39.99, it fails. Vague explanations about "pro-rated periods" without the math are insufficient.
The Response Generator Agent has three primary responsibilities: matching the customer's technical level, including all necessary information, and avoiding jargon.
Explaining a password reset like it's a PhD thesis? Fail.
Missing critical steps? Fail.
Wrong tone for the customer segment? You get the idea.
This binary approach feels harsh at first, but it brings incredible clarity. Your agents either meet the bar or they don't, and now you know exactly what needs to be fixed.
LLM-as-Judge for Subjective Evaluation
We couldn't manually review every response, so we built AI evaluators. However, here's the key: we established specific evaluation criteria. Vague instructions produce vague evaluations.
Take the empathy evaluation for the Response Generator Agent. We don't just ask "Is this empathetic?" Instead, we instruct: "Check if this response acknowledges the customer's specific frustration. Look for phrases that mirror their concern, not generic empathy. If a customer says 'I'm losing money every minute,' the response should acknowledge the financial impact, not just offer empty apologies."
The difference? Night and day. Generic instructions often get you the response, "I understand your frustration," for everything. Specific instructions get you: "I see this downtime is directly impacting your revenue. Let's fix this immediately."
Here's an actual eval prompt that works:
EVAL PROMPT:
You are evaluating a customer service response for appropriate tone and completeness.
CONTEXT:
- Customer Segment: Enterprise
- Customer Message: "This is the THIRD time I'm explaining that your API is returning 500 errors!"
- Agent Response: [RESPONSE_TO_EVALUATE]
EVALUATION CRITERIA:
PASS if ALL of the following are true:
- Acknowledges customer's frustration explicitly
- References the "third time" to show we heard them
- Provides specific technical next steps
- Includes timeline for resolution
- Offers an escalation path
FAIL if ANY of the following occur:
- Generic apology without acknowledging specific frustration
- No mention of the repeated explanations
- Vague solutions like "try again later"
- Defensive or dismissive language
OUTPUT: PASS/FAIL with one specific reasonThis specificity transformed evaluation accuracy from 60% to 85% agreement with human reviewers. The key is being explicit about what constitutes success and failure. No room for interpretation.
Step 3: Measure What Actually Matters
Don't just measure accuracy, it's misleading when most interactions succeed. If 95% of your responses are naturally good, 95% accuracy might mean you catch zero failures. It's like a security guard who's accurate 99% of the time because they wave everyone through.
Instead, focus on these critical metrics:
True Positive Rate (TPR) and True Negative Rate (TNR)
These two numbers tell you what's really happening.
True Positive Rate answers: Of all the successful resolutions, how many did we correctly identify as good? That's your confidence in green-lighting responses.
But here's the metric that actually matters: True Negative Rate. Of all the failures, how many did we catch before customers saw them? This is your safety net.
Many teams discover their initial TNR is around 45%, meaning more than half of failures reach customers. While everyone celebrates 95% accuracy, customers are seeing garbage half the time it matters. It's essentially flipping a coin on whether inadequate responses get blocked.
The target that actually protects your customers: >80% for both TPR and TNR. Until you hit those numbers, you're not ready for production. These metrics reveal the truth that overall accuracy hides.
How to Actually Calculate TPR and TNR
Let's make this concrete.
Say you evaluated 100 customer service responses:
60 were actually good responses
40 were actually inadequate responses
Your eval system judged:
55 good responses as PASS (True Positives)
5 good responses as FAIL (False Negatives)
30 inadequate responses as FAIL (True Negatives)
10 inadequate responses as PASS (False Positives)
Your metrics:
TPR = 55/60 = 92% (You correctly identified 92% of good responses)
TNR = 30/40 = 75% (You caught 75% of inadequate responses before customers saw them)
That 75% TNR means 25% of garbage responses are reaching customers. This is what needs fixing, not the overall 85% accuracy that looks impressive on paper.Implementation in Practice:
To track these metrics in your system, create a simple evaluation log. For each response your system generates, record three things: what your eval judged (PASS or FAIL), what the actual quality was (good or inadequate, verified by human review or customer feedback), and which agent generated it.
Every week, calculate your TPR and TNR by counting:
How many good responses did you correctly pass (True Positives)
How many inadequate responses did you correctly catch (True Negatives)
The totals for each category
The key is consistency. Use the same evaluation criteria every time. Track these numbers by agent, failure type, and customer segment.
Watch the TNR especially, it's your safety net. When TNR drops below 75%, that's your signal to investigate immediately.
Most teams begin with a simple spreadsheet, then transition to automated tracking once they've proven that the metrics are meaningful. The format doesn't matter as much as the discipline of measuring consistently.
Step 4: Test Multi-Agent Interactions
Individual agents working perfectly do not necessarily mean the system works. Multi-agent systems have unique failure modes that single-agent evals miss.
What happens when agents disagree? (Parallel Processing Conflicts)
When agents work in parallel, they can return conflicting information. Your Technical Support Agent says, "Keep the product, we'll fix it," while your Billing Agent insists, "Return required for refund." Now, your Solution Validator Agent has to act as the referee. Does it produce a coherent response that addresses both concerns, or is it nonsense trying to merge contradictory advice?
Testing This: Feed deliberately ambiguous scenarios like: "I bought your premium plan yesterday, but it's not working. I want my money back, but I really need this for tomorrow's presentation." Run 50 such conflicts; if fewer than 80% produce coherent responses, you have a problem.
One approach to fix it: Add a conflict resolution prompt to your Solution Validator: "When agents provide different recommendations, synthesize them into a single response that addresses all concerns in priority order: urgent needs first, then long-term solutions."
Can the system recover from state loss? (State Corruption Testing)
Information flow between agents is fragile. Here's a critical test: Your Classifier Agent correctly identifies a complex issue as "technical," but the state update fails. The Billing Agent receives the request with zero context. It doesn't even know what the customer asked about. Does your system detect this loss and recover, or does the Billing Agent start hallucinating answers?
Testing This: After your Classifier categorizes a request, deliberately pass an empty context or wrong category to the next agent. Most systems fail spectacularly, generating confident responses based on nothing.
One approach to fix it: Add pre-checks to every agent: "If context.category is null or context.customer_query is empty, respond: 'I need more information to help you properly' and escalate." Never let agents proceed without a validated context.
Do errors cascade or get contained? (Cascade Failure Testing)
One agent's error shouldn't take down the whole system. Test this: Your Classifier Agent outputs malformed JSON. Does the Solution Validator catch it and handle it gracefully, or does everything crash?
The difference between graceful degradation and cascade failure separates production-ready systems from demos. If one agent having a bad day brings down your entire operation, you're not ready for real customers.
Testing This: Inject malformed outputs at each stage, missing JSON braces, exceeded token limits, and wrong format types. Document whether errors propagate or get contained.
One approach to fix it is for every agent to validate inputs before processing and outputs before passing them on. If validation fails, return a safe default response rather than an error that breaks the chain.
Step 5: Continuous Eval-Driven Improvement
Evals aren't a one-time setup; they drive continuous improvement. Think of them as your system's GPS, constantly showing you where you are and where to go next.
Weekly Eval Review Process
Every week, run through this focused review cycle:
First, identify your top failures. Which agent has the lowest TNR? What failure pattern keeps showing up? Don't try to fix everything; focus on the most significant pain point.
Next, dig into root causes. Is it a prompt that needs clarity? Missing context between agents? An agent calling the wrong tool? Get specific about why it's failing, not just that it's failing.
Then apply a targeted fix. Perhaps you adjust a specific prompt, add a validation gate, or enhance how context is passed between agents. One surgical change, not a system overhaul.
Finally, measure the impact. Did TNR actually improve? Any unexpected side effects? Sometimes fixing one problem creates another, that's fine as long as you're moving forward overall.
A Real Example in Action
Here's how this process typically plays out:
You discover your Billing Agent has a 60% TNR, and it's failing 40% of the time. Root cause analysis reveals it doesn't understand pro-rated billing calculations.
The fix: Add specific pro-ration examples and calculation rules to its prompt.
The result?
TNR jumps to 85%. Yes, response time increased by half a second, but that's a worthy trade-off for actually getting billing explanations right.
This isn't random tinkering; evals tell you exactly where to focus your effort for maximum impact.
Creating Your Eval Dataset
The way you split your data determines whether you're building a robust system or creating misleading metrics. Many teams discover that their "95% accurate" system is actually failing to meet the needs of half their customers; they were testing the wrong things.
Split your data strategically with the 20-40-40 approach:
Use 20% of your data as your training set, where you develop and refine your LLM-as-Judge prompts. This is your experimentation playground.
Allocate 40% to your dev set; your iteration space for testing improvements. This is where you'll spend most of your time, tweaking and measuring.
Reserve 40% of your data as a test set, which should remain completely untouched during development. This pristine data provides you with an unbiased view of your system's performance.
Critical Edge Cases to Include
Don't just test the happy path; also focus on your edge cases. These edge cases may account for only 5% of the volume, but they cause 50% of the escalations. These can include:
Frustrated customers: "This is the THIRD TIME I'm explaining this issue". Tests whether your system maintains patience and context when users are angry.
Multiple simultaneous issues: "Can't log in, was charged twice, and needs urgent API help". This challenges your routing and prioritization logic.
Misused technical terms: When non-technical users confidently use incorrect terminology, it reveals whether your agents can decipher intent despite the wrong vocabulary.
Language barriers: Broken English or auto-translated requests; tests robustness against imperfect communication.
Time-sensitive requests: After-hours emergencies validate your urgency detection and escalation paths.
Distribute these edge cases proportionally across all three data sets in the 20-40-40 approach. They're your insurance policy against real-world chaos.
Operationalizing Your Evals
Integration with CI/CD
Every deployment automatically runs four essential checks:
unit evals for each agent
integration evals for agent handoffs
end-to-end conversation flows
performance benchmarks
If any check fails, deployment stops. This transforms anxious weekend deployments into confident weekday releases.
Real-time Monitoring
Track these metrics daily in production:
True Positive/Negative Rates: Rolling 7-day average per agent
Failure Pattern Distribution: Which errors are happening most
Eval Pass Rate Trends: Going up or down?
Escalation Correlation: Do eval failures predict human takeovers?
These aren't just numbers, they're early warning signals.
Eval Dashboards That Matter
Forget the 47-metric dashboard that overwhelms everyone. Focus on five actionable insights:
The Executive Dashboard:
Which agent is failing the most right now?
Top 5 failure patterns this week (not last month)
Single trend graph showing TNR improvement or decay
Correlation between eval failures and customer complaints
Human escalation percentage
This single page gets reviewed every Friday. If leadership can't understand your dashboard in 30 seconds, simplify it. Complexity doesn't impress anyone when the system is down.
Phase 6: Production Readiness - Going Live
The temptation to flip the switch immediately is real, especially with the pressure from executives to go live. But rushing to production is like opening a restaurant without training your staff. You need a more innovative approach.
Monitor Everything That Matters
Before going live, track these six survival metrics from day one:
Response time per agent: Which specialist is your bottleneck?
First-contact resolution rate: Solving problems or playing ticket ping-pong?
Escalation frequency: How often humans rescue the AI
Cost per transaction: The number that makes CFOs smile
Customer satisfaction scores: The truth about whether this works
Eval pass rates per agent: Your early warning system
Check these every morning, just like a pilot checks instruments before takeoff.
Adding Advanced Capabilities (When You're Ready)
Once basics are solid, consider Multi-Agent RAG for complex questions. When a customer asks, "How do I integrate your API with Salesforce?", three specialists work simultaneously:
Knowledge Retrieval Agent searches the API documentation
Case History Agent finds similar Salesforce integrations
The Technical Support Agent checks recent changes and known issues
These perspectives synthesize into one comprehensive answer, three experts collaborating instead of one generalist guessing.
The Phased Rollout Strategy
Use evals as gate checks between phases:
Shadow Mode: Run AI parallel to humans without sending responses to customers. Must achieve an 80% evaluation pass rate to proceed. You'll discover that about 20% of responses have subtle errors humans would catch.
Beta Testing: Route 10% of technical tickets to AI with human backup. Monitor eval metrics against human baseline. Interestingly, AI consistency often beats human variation.
Expansion: Increase to 50% technical, add 10% billing. Only expand if TNR stays above 75%. Real patterns and edge cases emerge. Team confidence builds as metrics hold.
Full Production: All tickets are processed through AI first, with humans ready for escalations. Continuous eval monitoring catches issues before customers notice them.
Each phase de-risks the next. Skip these gates at your own peril.
Phase 7: Continuous Optimization - Life After Launch
Learning From Your Evals
The real magic happens after launch. Your evals become a continuous improvement engine.
Watch for patterns. When 30% of Technical Support Agent failures involve authentication, don't endlessly tweak prompts; create a specialized Authentication Agent. Those tickets drop from 5-minute resolutions to 30 seconds, with 95% pass rates.
Another discovery: The Response Generator continues to fail tone evaluations for startup customers. You're addressing scrappy founders like Fortune 500 executives. Add customer segment awareness to the prompt and watch satisfaction scores jump 0.3 points overnight.
Building Institutional Memory Through Evals
Every failure becomes future prevention:
Failed eval examples → Training data for next iteration
Successful eval criteria → Quality standards for new features
Edge cases that break evals → Regression test cases
Eval improvements → Documented best practices
This isn't just documentation, it's your system getting smarter every week.
The Path Forward
After months of building, breaking, and fixing, here are the lessons that matter:
Your first version will humble you. Ship it anyway and improve.
Specialization is your superpower. Narrow agents outperform generalists every time.
Memory matters more than model. State management makes or breaks the experience.
Gates catch what testing misses. Validation between agents prevents cascade failures.
Most critically, without evaluations, you're flying blind. They're not optional, they're essential. Binary pass/fail brings clarity where "pretty good" creates confusion. TNR beats accuracy, catching failures matters more than overall percentages. And production teaches what testing can't; real customers will use your system in ways you never imagined.
The counterintuitive truth?
Starting with three agents is better than implementing eleven. Your evaluation system is as essential as your agents themselves. Production-ready doesn't mean perfect; it means having the infrastructure to detect problems, understand failures, and improve systematically.
Your agents will make mistakes. Your orchestration will have gaps. Edge cases will surprise you. That's not failure, that's data. The multi-agent system you ship next week won't be the one running next year. It will be better, shaped by thousands of real interactions, guided by evaluation metrics, and refined through continuous learning.
This is the reality of building AI systems that actually work: embracing imperfection while building the tools to improve. Ship early, measure obsessively, iterate relentlessly.
Your customers are waiting. Your evaluation framework is ready. Your agents are trained.
Time to go live.



Comments