Building Reliable Agents: Lessons from Production

I've shipped agent systems to production three times at two different companies. Each time, the same categories of problems surfaced — just wearing different clothes. Here's what I wish someone had told me before my first deployment.

The Four Horsemen of Agent Failures

1. Context Window Overflow

The most common production failure for agents handling real-world tasks. You test with a simple input, it works perfectly. You deploy. A user submits a 50-page PDF. The agent tries to stuff it all into context, hits the limit mid-reasoning, and returns garbage.

Fix: Never let raw input touch the model unprocessed. Always have a preprocessing layer that chunks, summarizes, or extracts relevant sections. Treat context as a scarce resource, not a buffer.

2. Tool Call Loops

An agent calls a tool. The tool returns an error. The agent tries to fix it by calling the tool again. The same error. Repeat until you've burned 50K tokens and $4 in API calls.

Fix: Implement a max-retry counter per tool call (I use 3). On the third failure, return a structured error to the agent and let it decide whether to try a different approach or escalate to a human.

3. Hallucinated Tool Parameters

The agent is given a tool signature. It calls the tool with a plausible-looking but incorrect parameter — a field that doesn't exist, a value in the wrong format, a required field missing. Your tool throws an exception. The agent gets confused.

Fix: Validate all tool inputs before executing. Return schema validation errors as structured tool results (not exceptions). The agent can then correct and retry with good information.

4. Silent Degradation

The agent completes, returns a result, no errors logged. But the result is wrong — subtly, in a way that's hard to catch automatically. The user gets bad output. You find out via a support ticket three days later.

Fix: This is the hardest one. The only real solution is automated output evaluation. Run a lightweight evaluator model (or rule-based checks) against every output before it reaches the user. Define what "good" looks like for your task, and check for it.

Observability First, Always

Every team that runs agents in production successfully has the same superpower: they know exactly what happened in every run.

The minimum viable observability stack:

Per-run: run_id, start_time, end_time, total_tokens, total_cost, status Per-step: step_number, agent_name, tool_calls (with inputs/outputs), latency Errors: exception type, stack trace, the exact input that caused it

If you can't answer "why did this run fail?" by looking at logs within 30 seconds, you don't have enough observability.

Tools worth using: LangSmith (best DX), Langfuse (best self-hosted), Helicone (cheapest for token tracking).

The Timeout Problem

Production agents need timeouts at every level:

Per tool call: 30 seconds max (web scraping, API calls)
Per agent step: 120 seconds max
Per full run: 10 minutes max for most tasks

I've seen production systems brought down because one web search tool hung indefinitely waiting for a server that never responded. Every external call must have a timeout. Non-negotiable.

State Management

The question I get asked most: "How do you persist agent state across failures?"

Short answer: checkpoint aggressively. After every meaningful step, write the current state to durable storage (Redis, Postgres, Cosmos DB — doesn't matter). If the agent process dies, it can resume from the last checkpoint instead of starting over.

The pattern that works best for most teams:

{
  run_id: string,
  status: "running" | "completed" | "failed" | "paused",
  checkpoint: {
    step: number,
    accumulated_results: any[],
    context_so_far: string
  },
  created_at: timestamp,
  updated_at: timestamp
}

Write this to your DB after every step. If the agent crashes, a recovery process picks up from step and continues.

Human-in-the-Loop Patterns

Not every agent action should be fully autonomous. The pattern I use:

Confidence scoring: Before executing any irreversible action (sending an email, making a payment, deleting data), have the agent score its confidence 1-10. If below 7, pause and ask for human approval.

Dry run mode: For new deployments, run in dry-run mode for the first week. The agent plans actions but doesn't execute them. A human reviews the plan. This surfaces edge cases before they cause damage.

Audit trails: Every action an agent takes should be logged with: who triggered it, what the agent decided, why (the reasoning), and what the outcome was. This is essential for debugging and for trust-building with stakeholders.

The Reliability Checklist

Before any agent goes to production:

[ ] Input validation with clear error messages
[ ] Max retry limits on all tool calls
[ ] Timeouts at every level (tool, step, run)
[ ] Checkpoint storage after each meaningful step
[ ] Output evaluation before delivery
[ ] Observability stack with run/step/error logging
[ ] Rate limiting to prevent runaway costs
[ ] Human escalation path for low-confidence decisions
[ ] Load testing with adversarial inputs (long docs, edge cases, malformed data)
[ ] Runbook for common failure modes

None of this is glamorous. But it's the difference between an agent demo and an agent product.

Stay in the know