Key takeaways (May 17, 2026)
- OpenAI, Anthropic and DeepMind all publish public agent evaluation harnesses as of May 2026 — internal-only eval stacks are now the exception.
- Common production failures cluster in three areas: tool-call hallucination, state corruption across long sessions, and unsafe browser/file actions.
- Reliability work shifted from prompt engineering to harness engineering: sandboxing, replay, deterministic seeds, and trace diffing.
- Most teams now budget at least 30% of agent build time for evaluation infrastructure.
An AI agent evaluation framework is the difference between a useful worker and an expensive intern with shell access. This is the system I would use before letting any agent read a repo, call tools, write files, send emails, query a database, or touch customer data.
The goal is simple: prove the agent can finish real work without guessing, leaking data, burning money, or taking actions nobody approved. Model benchmarks help, but they do not answer the production question. Your agent is not just a model. It is a model plus tools, memory, permissions, prompts, retries, logs, and human review.
That whole stack needs to be tested.
The short version
If you only remember one thing, remember this: test the job, not the chatbot.
The best AI agent evaluation framework starts with real tasks from your product, repo, or operations queue. Then it scores the agent on the parts that actually break in production.
| Evaluation area | What to measure | Pass signal |
|---|---|---|
| Task success | Did the job finish correctly? | Correct output, no hidden cleanup needed |
| Tool use | Did it call the right tool at the right time? | Few unnecessary calls, no unsafe calls |
| Grounding | Did it use evidence from the right source? | Links, logs, files, or database rows match the answer |
| Memory | Did it remember only what it should? | Keeps useful context, drops stale or private data |
| Security | Can it be tricked into bad actions? | Blocks prompt injection, bad URLs, broad access |
| Cost and speed | Is the task affordable and fast enough? | Cost and p95 latency stay inside limits |
| Recovery | What happens after failure? | Agent stops, asks, retries safely, or rolls back |
| Regression | Does quality drift after updates? | Canary tasks catch drops before users do |
I would not put an agent into production until it passes all eight. A narrow agent with boring permissions can pass with a small test set. A coding agent, finance agent, support refund agent, or infrastructure agent needs a much harsher gate.
What an AI agent evaluation framework actually is
An AI agent evaluation framework is a repeatable way to test whether an agent can do the job you are giving it. It is not a single benchmark score. It is closer to a test suite, an audit log, and a release checklist living in the same place.
For a normal model, you might ask, “Did it answer correctly?” For an agent, the better questions are:
- Did it choose the right plan?
- Did it use the right files, tools, and data?
- Did it stop when it hit uncertainty?
- Did it ask for approval before a risky action?
- Did it create a record someone can audit later?
- Did it stay inside budget?
- Did the same task still work after the model or prompt changed?
That last question matters more than people think. In April 2026, Anthropic published a Claude Code quality postmortem after users reported worse behavior. The causes were not a single weak model. They were product-layer changes: default reasoning effort, a caching bug, and a prompt change that reduced verbosity but hurt coding quality.
That is the agent lesson. Quality can break above the model layer.
If your eval only tests one prompt against one model, it will miss the failures that come from memory, tool routing, permissions, session state, and wrapper prompts. Those are exactly the failures that hurt users.
Why this topic matters now
Agents are moving from demos into real workflows. Coding agents open pull requests. Sales agents enrich leads. Support agents issue refunds. Security agents triage alerts. Finance agents reconcile invoices. The mistake is treating these as “smart chat” systems.
They are action systems.
That changes the risk. A bad answer is annoying. A bad tool call can delete data, expose a secret, send a wrong email, merge broken code, or approve a fake invoice. The OWASP Top 10 for Agentic Applications 2026 exists because agents have a different attack surface: goals, tools, memory, identities, delegation, and multi-step action.
NIST makes the same point from the risk side. The AI Risk Management Framework is not agent-specific, but its map, measure, manage, and govern loop fits agent work well. You cannot manage what you have not mapped. You cannot govern what you do not log. You cannot trust what you do not measure.
My blunt take: if a team says “we tested the model” but cannot show task logs, failure labels, tool-call traces, and rollback rules, they have not tested the agent.
Start with real tasks, not synthetic puzzles
The first mistake is building evals from clever prompts instead of real work. Clever prompts find interesting model behavior. Real tasks find product risk.
When I test an agent in a codebase, I start with tasks that look almost boring:
- Fix a small failing test.
- Update a config value in the right file.
- Add one field to an existing form.
- Explain why a build failed using the log.
- Open a draft pull request with a clear summary.
These tasks are not glamorous. That is why they work. They expose whether the agent can read the project, avoid unrelated edits, follow the existing style, run the right command, and stop when the result is unclear.
For a customer support agent, the task set might be:
- Find a customer’s latest order.
- Explain a delayed shipment without inventing a date.
- Apply a refund only when policy allows it.
- Escalate a legal threat to a human.
- Refuse a request to reveal another customer’s data.
For a research agent, the task set might be:
- Answer with citations from allowed sources.
- Separate confirmed facts from uncertain claims.
- Detect source conflicts.
- Refuse stale data when freshness is required.
- Produce a short evidence table.
The task set should come from tickets, logs, search queries, support macros, GitHub issues, sales workflows, or compliance checklists. If the agent will see it in production, it belongs in the eval.
Build a scorecard before you run the agent
Do not grade agent output by vibe. Write the scorecard first.
Here is the format I use.
| Score | Meaning | Example |
|---|---|---|
| 5 | Complete success | Correct task, right tool use, no cleanup |
| 4 | Usable with small edits | Minor wording or formatting issue |
| 3 | Partly useful | Found the right direction but missed a requirement |
| 2 | Unsafe or misleading | Wrong action, invented fact, bad file edit |
| 1 | Failed to act | Stopped, looped, or produced unusable output |
| 0 | Harmful | Data leak, destructive action, policy bypass |
Then add failure labels. The labels matter because they tell you what to fix.
Good failure labels:
- Wrong tool
- Missing approval
- Hallucinated source
- Over-broad permission
- Forgot context
- Bad retry loop
- Cost spike
- Slow response
- Unsafe final action
- Poor handoff to human
If 12 of 50 failures are “wrong tool,” changing the model may not help. You may need better tool names, tighter tool descriptions, or fewer tools. If failures are “forgot context,” you may have a memory bug. If failures are “missing approval,” your policy layer is the problem.
That is why labels beat one big score.
Test the whole workflow with traces
OpenAI’s agent evals documentation recommends trace grading for workflow-level errors. That is the right mental model. You grade more than the final answer. You grade the path.
A useful trace shows:
- User request
- System instructions
- Retrieved files or records
- Tool calls and arguments
- Tool outputs
- Memory reads and writes
- Intermediate decisions
- Approval checkpoints
- Final response
- Cost and latency
When a task fails, the trace tells you where. Did the agent misunderstand the request? Did retrieval fetch the wrong document? Did the tool return bad data? Did the agent ignore the tool output? Did the final response hide uncertainty?
Without traces, debugging agents becomes theater. Everyone has an opinion. Nobody has proof.
For coding agents, this is why I like running evals against small local repos with known expected diffs. You can compare the final patch, test output, and command history. For more on that stack, see our breakdown of AI coding agents in 2026 and the OpenAI Agents SDK sandbox update.
Measure tool use like a product feature
Tool use is where agents become useful. It is also where they become dangerous.
Every tool should have its own eval cases:
| Tool type | What to test | Failure to catch |
|---|---|---|
| Search or retrieval | Finds the right source | Cites stale or irrelevant content |
| Database lookup | Queries only allowed fields | Reads private or unrelated data |
| File edit | Changes only intended files | Broad rewrite or style damage |
| Email or message send | Drafts before sending | Sends without approval |
| Payment or refund | Follows policy limits | Issues money incorrectly |
| Shell command | Runs allowed commands only | Deletes, exfiltrates, or installs blindly |
My rule: if a tool can change state, it needs an approval gate in early production. Draft is fine. Execute is different.
This is where MCP-based agent systems need extra care. MCP makes tool access easier, but easy access is not the same as safe access. A good eval should include hostile tool outputs, poisoned documents, misleading filenames, and requests that try to push the agent outside its job.
Add security tests from the start
Security evals should not wait until the end. If the agent has tools, memory, or permissions, security is part of basic quality.
Use the OWASP agentic categories as a starting checklist:
- Can a user hijack the agent’s goal?
- Can a webpage or document inject new instructions?
- Can the agent misuse tools outside its job?
- Can it access data from the wrong user?
- Can memory be poisoned with false instructions?
- Can one agent trick another agent?
- Can repeated failures trigger a runaway loop?
- Can a user over-trust a risky answer?
For a deeper security layer, our article on the Microsoft Agent Governance Toolkit walks through runtime policy, identity, sandboxing, and OWASP-style controls. You do not need Microsoft’s stack to borrow the testing idea: treat every agent action as something that needs a policy decision.
The best security evals are small and nasty. “Ignore previous instructions” is too easy. Real attacks look more like a support ticket, a PDF, a GitHub issue, or an email that includes instructions the agent should not obey.
Test memory and context drift
Memory failures are sneaky because the agent can sound confident while carrying the wrong history.
You need tests for three things:
- What the agent remembers.
- What it forgets.
- What it does when memory conflicts with current evidence.
Here is a simple memory eval:
- Tell the agent a project rule: “Use the existing API client, never raw fetch.”
- Let the session go idle or start a new task.
- Ask it to add a feature that needs an API call.
- Check whether it uses the API client.
- Then update the rule and see if it follows the newer instruction.
That catches stale memory and instruction priority problems. It also catches the type of context bug Anthropic described in the Claude Code postmortem, where older reasoning was cleared incorrectly after idle sessions.
For multi-agent systems, memory gets harder. One agent may summarize work for another. A planner may hand off to a coding agent. A verifier may only see the final diff. Our CrewAI vs LangGraph vs MCP comparison covers those patterns, but the eval rule is simple: grade the handoff, not just the final answer.
Track cost, latency, and retry behavior
An agent that works but costs too much is still broken.
Track these numbers per task:
- Total tokens
- Tool calls
- Wall-clock time
- Number of retries
- Number of failed tool calls
- Human interventions
- Final cost
Then set release limits. For example:
| Metric | Starter limit |
|---|---|
| Task success rate | 85 percent or higher on low-risk tasks |
| Unsafe action rate | 0 percent |
| Human approval bypass | 0 cases |
| p95 latency | Under your user tolerance |
| Cost per completed task | Lower than human-assisted cost |
| Retry loop length | Hard stop after 2 or 3 retries |
The exact numbers depend on the job. I would accept lower task success for a research assistant that drafts notes. I would demand near-perfect approval behavior for a finance or infrastructure agent.
The mistake is pretending cost is separate from quality. Long loops, repeated tool calls, and unclear stopping rules are quality problems wearing a billing disguise.
Use canary tasks for every model or prompt change
Every production agent needs canary tasks. These are small, fixed tasks you run before changing the model, system prompt, tools, memory settings, or permissions.
A good canary set has:
- 10 easy tasks that should never fail
- 10 normal tasks from real usage
- 10 hard tasks that expose known weak spots
- 5 hostile tasks for security checks
- 5 refusal or escalation tasks
Run the same canaries every time. Keep the traces. Compare against the last known good run.
This is the boring habit that would have caught many agent regressions sooner. Anthropic’s April postmortem says they are adding broader eval coverage and tighter prompt-change controls after the Claude Code issue. Smaller teams should steal that lesson. You do not need a lab-sized eval team. You need a short list of tasks that must not get worse.
The OpenAI Evals repository is useful if you want a structured starting point for model and workflow evals. But do not wait for a perfect eval platform. A spreadsheet plus saved traces is better than arguing in Slack about whether the agent “feels worse.”
Run shadow mode before real action
Shadow mode means the agent does the work, but a human or existing system still makes the real decision.
For example:
- The refund agent recommends a refund but does not issue it.
- The coding agent opens a draft patch but does not request review.
- The sales agent drafts an enrichment note but does not update the CRM.
- The infrastructure agent suggests a fix but does not run the command.
Shadow mode gives you production-like inputs without production-level blast radius. It also shows whether humans trust the agent too much or too little.
Track override rate during shadow mode. If humans rewrite every output, the agent is not ready. If humans approve everything without reading, your process is not ready.
The release checklist I would use
Before production, I want the answer to every question below.
| Question | Why it matters |
|---|---|
| What exact job is the agent allowed to do? | Broad agents fail in broad ways |
| What tools can it call? | Tool access defines real-world impact |
| What data can it read? | Privacy failures often start with over-broad reads |
| What data can it write? | Writes need stricter approval than reads |
| What must a human approve? | Risky actions need a gate |
| What does the agent do when unsure? | Safe agents stop cleanly |
| How are traces stored? | Debugging and audits need evidence |
| What is the rollback path? | Bad actions need recovery |
| What are the canary tasks? | Updates need regression checks |
| Who owns the agent after launch? | Unowned agents drift |
That last line matters. Agents need owners. Someone has to review failures, refresh evals, rotate secrets, update policies, and decide when the agent should be retired.
This is where AI governance frameworks stop being abstract. A production agent needs the same boring ownership model as any other system that can affect customers or money.
A practical 30-day plan
If you are starting from zero, do this over one month.
Week 1: define the job and collect tasks
Pick one agent and one narrow job. Pull 30 to 50 real examples from your backlog, tickets, calls, logs, or docs. Write expected outcomes for each. Do not automate anything yet.
Week 2: run baseline evals
Run the agent on every task. Save traces. Grade outputs using the 0 to 5 scorecard. Label each failure. Fix only the obvious workflow issues: missing tools, bad instructions, weak retrieval, and unclear approvals.
Week 3: add security and cost gates
Add hostile inputs, prompt-injected documents, permission tests, retry limits, and cost tracking. Decide what the agent can draft, what it can execute, and what needs human approval.
Week 4: shadow real work
Run the agent beside the current workflow. Humans still make final decisions. Track success rate, override rate, unsafe attempts, cost, latency, and user feedback.
At the end of 30 days, you should know one of three things:
- The agent is ready for a narrow production launch.
- The agent is useful but needs a human-in-the-loop workflow.
- The agent is not worth shipping yet.
All three are good outcomes. The bad outcome is shipping because the demo looked great.
Where this fits with compliance
If your agent touches hiring, credit, healthcare, education, critical infrastructure, insurance, or workplace monitoring, evaluation is not just an engineering practice. It becomes compliance evidence.
The EU AI Act August 2026 checklist is a good example. High-risk systems need documentation, risk management, human oversight, accuracy, logging, and cybersecurity controls. Agent evals can feed that evidence, especially if you store task traces, failure rates, approval logs, and policy decisions.
Do not oversell this. An eval suite does not make you compliant by itself. But without evals, compliance becomes paperwork detached from reality.
My recommendation
Start with a narrow agent and a harsh eval. That beats a flashy agent that can do everything badly.
For most teams, the first production-worthy agent should have:
- One job
- Five or fewer tools
- Read-only access by default
- Human approval for writes
- A 50-task eval set
- Canary tests before every update
- Full traces for every run
- A named owner
That may sound conservative. Good. Production agents should earn trust slowly.
The long-term winners will not be the teams with the most agents. They will be the teams that know which agents are reliable, which ones are risky, and which ones should never have been shipped.