The SRE world has had blameless post-mortems figured out for over a decade. The process is well-understood: describe the incident, build a timeline, identify the root cause, assign action items. For traditional infrastructure failures, this works extremely well.
AI agent incidents are different. The failure modes are different. The data you need is different. The action items are different. And critically, the observability requirements are different. Here's how to adapt the post-mortem process for the agent era.
What's different about AI agent incidents
Traditional infrastructure failures are usually deterministic: given the same inputs and the same state, the system fails the same way every time. You can reproduce the bug, write a test, fix it, verify the fix.
AI agent failures are probabilistic and stateful. The same input might succeed 99% of the time and fail 1% — and the 1% that fail might fail for different reasons. The agent's decision-making is influenced by LLM outputs that are not deterministic. The state accumulated across a long run affects every subsequent decision.
This makes post-mortems harder in one way (reproduction is difficult) and potentially more important in another: every real incident is a window into the actual failure distribution that you can't easily synthesize in testing.
The AI agent incident timeline
A good post-mortem starts with a detailed timeline. For AI agent incidents, that timeline needs to include:
- The trigger: what input or event started the failing run?
- The decision tree: which branches did the agent take, and why?
- The divergence point: at which step did behavior deviate from expected?
- The cascading effects: what did the agent do after the initial failure point?
- External state changes: what changed in the external systems the agent interacted with?
Without run replay capability, building this timeline requires piecing together fragments from multiple sources — API logs, database queries, email sends. It's slow, incomplete, and often impossible. With replay, you get the complete timeline directly.
Root cause categories for AI agent failures
AI agent failures tend to cluster into a few categories. Having a shared vocabulary for these makes post-mortems faster and more actionable:
- Tool failure: a downstream tool call failed (API error, auth error, timeout). The agent may have handled this well or poorly.
- Prompt drift: the LLM's response to a prompt changed in a way that caused the agent to take an unexpected path. Often triggered by model updates.
- State corruption: the agent accumulated bad information early in a run that caused all subsequent decisions to be wrong.
- Environmental change: something changed in the external environment (credential rotation, API schema change, rate limit change) that the agent wasn't designed to handle.
- Edge case input: a combination of input values that was never tested and exposes an assumption in the agent's logic.
A 5-step post-mortem process for AI agents
1. Pull the failing run. Before any discussion, get the complete replay of the failing run in front of the team. Walk through it step by step. Agree on the facts before you discuss causes.
2. Find the divergence point. Compare the failing run to a successful run with similar inputs. Where did they first diverge? This is usually where you'll find the root cause.
3. Classify the failure type. Use the categories above. This determines what kind of fix is appropriate.
4. Write the blameless narrative. Document what happened, what the agent was trying to do, what went wrong, and what the impact was. Keep it factual and systems-focused.
5. Define actionable items. For each root cause, define a specific, measurable action. "Improve the agent" is not an action item. "Add error handling for SMTP_AUTH_INVALID in the send_email tool with a credential refresh fallback" is an action item.
The most valuable post-mortem artifact for an AI agent incident is the run replay. Everything else is interpretation.
Making post-mortems faster over time
The best post-mortem process gets faster over time as you build a library of resolved incidents to compare against. When a new failure looks like a past failure, you can often identify the root cause in minutes rather than hours. This only works if you have searchable, structured run history — which is exactly what Agent Basin provides.
Get the run replay capability your post-mortem process needs.
Connect your first agent — free