The Prompt Engineering Playbook for Agent Developers
Prompting a single LLM and prompting an agent system require completely different mental models. Here's the playbook we've developed from hundreds of production deployments.
Sarah Chen
Sarah is an AI researcher who consults on agent deployments for enterprise teams.
The first thing most developers learn when building agent systems: prompting techniques from single-turn LLM work mostly don't transfer. The failure modes are different. The stakes are higher. And a prompt that works in the playground will fail in ways you didn't anticipate once it's inside a running agent loop.
Here's the playbook we've developed from shipping and debugging agent systems across dozens of production deployments.
Mental Model Shift: You're Writing a Job Description
When you write a system prompt for an agent, you're not writing a prompt — you're writing a job description for a new employee who has never worked with you before, is slightly literal, and will work exactly as instructed even if the instructions are subtly wrong.
This framing changes how you think about prompts:
- Be explicit about what the agent CAN'T do, not just what it can
- Define failure modes: what should the agent do when it's stuck, when a tool fails, when the input is ambiguous?
- Specify output format precisely: don't say "format the response clearly" — show an example
- Define escalation: when should the agent stop and ask for human input vs. make its best guess?
The Four-Part Agent System Prompt
Every agent system prompt I write has four sections:
1. Identity & Role
You are a research agent for [Company]. Your job is to [specific task].
You have access to [tool list]. You DO NOT have access to [out-of-scope tools].
You CANNOT take the following actions: [list irreversible/dangerous actions].
Be explicit about negative constraints. An agent with undefined boundaries will discover them through failure.
2. Process
For every task:
First, identify what information you need
Gather information using available tools
If you encounter an error, [specific recovery behavior]
When you have enough information, synthesize and format your output
Before returning, verify your output matches the required format
Agents perform significantly better when given an explicit process to follow. The model doesn't default to a good process — you have to define it.
3. Output Format
Always include a concrete example of the expected output. Not a description of it. An example.
Your output must follow this exact format:
SUMMARY: [2-3 sentence summary]
KEY_FINDINGS:
- [finding 1]
- [finding 2]
CONFIDENCE: [HIGH/MEDIUM/LOW]
SOURCES: [list of sources]
If the format isn't exact, parsing breaks downstream. Downstream breaks are the hardest agent bugs to trace.
4. Edge Cases & Escalation
If you cannot complete the task because:
- The required information is not available → Return RESULT: INSUFFICIENT_DATA with explanation
- A tool returns an error twice → Return RESULT: TOOL_FAILURE with the error message
- The task is outside your scope → Return RESULT: OUT_OF_SCOPE with explanation
Do not guess when you don't have information. Acknowledge uncertainty explicitly.
This section is the difference between agents that fail gracefully and agents that fail silently.
Tool Descriptions Are Prompts
Every tool your agent has access to has a description. That description is a prompt. Most developers write it carelessly.
Bad tool description:
search_web: Search the internet
Good tool description:
search_web: Search the web for current information. Best for: recent news,
current events, real-time data. NOT for: information from before 2020
(use knowledge_base instead), proprietary company data (use internal_docs instead).
Returns: top 5 results with titles, URLs, and 2-sentence summaries.
Input: query (string, max 150 characters, be specific)
The extra 4 lines reduce incorrect tool selection by ~40% in our testing.
The Three Prompting Mistakes That Kill Production Systems
Mistake 1: Vague Success Criteria
"Do a good job" is not a success criterion. Agents need to know what done looks like.
Bad: "Research the company and provide useful information."
Good: "Research the company and return: (1) founding year and founders, (2) current funding stage and total raised, (3) main product and target customer, (4) top 3 competitors. If any information is not publicly available, say so explicitly."
Mistake 2: No Recovery Path for Tool Failures
If you don't tell the agent what to do when a tool fails, it will either retry infinitely (burning tokens) or return a confusing error. Always define recovery behavior.
For every tool the agent uses, answer: "What should the agent do if this tool fails?"
Mistake 3: Conflicting Instructions
In longer system prompts, instructions often contradict each other. "Be concise" in one section, "be thorough" in another. "Always use tools to verify information" but also "trust your knowledge for basic facts."
Before deploying, read your system prompt and explicitly check for conflicts. Agents handle conflicts poorly — they'll pick one instruction and ignore the other, and you won't know which until it matters.
Testing Your Agent Prompts
The testing framework that's saved me the most time:
Unit tests for prompts: Define 10-20 test cases that cover the main task, edge cases, and known failure modes. Run them before every prompt change.
Adversarial inputs: What happens when the user provides garbage input? An empty string? A 100-page document? Inputs in the wrong language? Test these explicitly.
Regression tests: Once an agent is in production and you've fixed a bug, add the failing case to your test suite. Agent bugs tend to recur in different forms.
Output validation: Every agent output should pass a validation function before it's returned to the user. Schema validation, basic sanity checks (is the output non-empty? does it contain the required fields?), content checks (is the confidence field a valid value?).
The Prompt Iteration Loop
The cadence that works best for prompt development:
Don't try to write a perfect prompt in one pass. Treat prompt engineering like software development: iterate, test, fix, repeat.
The teams that ship reliable agent systems aren't the ones who write better prompts. They're the ones who test more systematically.