How I Build Production AI Agents (Not Demos)
Most AI agent demos work great in a notebook. They fail in production because the same shortcuts that make demos fast — skipping validation, ignoring cost, assuming tools always succeed — are the exact things production punishes.
Here is how I approach every agent I ship.
The failure modes that kill agents in production
Before designing anything, I map the ways the system can go wrong:
- Tool failure — an external API is down, rate-limited, or returns unexpected data
- Cost runaway — a loop adds tokens on every step; a $0.10 request becomes $80
- Hallucinated tool calls — the model invents arguments or calls tools that don't exist
- Context explosion — conversation history grows until you hit the context window
- Silent wrong answers — the agent confidently returns plausible but incorrect output
Every design decision I make targets one of these.
Tool design: minimal, typed, and idempotent
Each tool should do exactly one thing. A tool called search_knowledge_base should search. Not search, then summarize, then format. Compound tools are harder to validate and easier to hallcinate.
Every tool gets:
- A typed input schema — validated with Pydantic or Zod before the model sees it
- Idempotency — calling it twice with the same input is safe
- A defined error contract — tools return structured errors, not exceptions that bubble into the agent loop
class SearchInput(BaseModel):
query: str = Field(..., min_length=3, max_length=500)
top_k: int = Field(default=5, ge=1, le=20)
@tool
def search_knowledge_base(input: SearchInput) -> SearchResult:
"""Search the knowledge base. Returns up to top_k relevant chunks."""
try:
results = vector_db.query(input.query, k=input.top_k)
return SearchResult(chunks=results, query=input.query)
except VectorDBError as e:
return SearchResult(chunks=[], error=str(e))
The model sees the schema, not the implementation. Good schema descriptions cut hallucinated arguments by a large margin.
Cost control: caps, not hope
Every agent I build has explicit cost caps at three levels:
Per-step cap: max tokens per LLM call. Set via max_tokens on the model call — not as a prompt instruction the model can ignore.
Per-run cap: max number of iterations. In LangGraph, this is recursion_limit. In LangChain, it is max_iterations. Set it to something that makes sense for the task, not a large default.
Per-user/per-day cap: tracked in Redis. Each agent run records its token usage. If a user hits their budget, the run is declined before it starts — not halfway through.
def check_budget(user_id: str, estimated_tokens: int) -> bool:
key = f"budget:{user_id}:{today()}"
current = redis.get(key) or 0
if int(current) + estimated_tokens > DAILY_TOKEN_LIMIT:
return False
redis.incrby(key, estimated_tokens)
redis.expire(key, 86400)
return True
Cost surprises kill trust. Hard caps prevent them.
Guardrails: validate the output, not just the input
Input validation catches bad tool calls. Output validation catches bad answers.
For every agent I build, I define what a valid output looks like — as a schema, not prose. Then I validate it.
For structured outputs, this is straightforward: use with_structured_output and a Pydantic model. For text outputs, I validate against a set of rules: minimum length, absence of certain patterns (model apologies, hedging phrases that signal the model is guessing), presence of required fields.
If output validation fails, I retry once with a corrective prompt. If it fails again, I return a structured error rather than a bad answer.
Observability: log everything
Every tool call gets a log entry: timestamp, tool name, input, output, latency, token count, cost estimate. I store these in a runs table with a thread_id.
@contextmanager
def trace_tool_call(tool_name: str, run_id: str):
start = time.monotonic()
try:
yield
finally:
latency_ms = (time.monotonic() - start) * 1000
db.insert("tool_calls", {
"run_id": run_id,
"tool": tool_name,
"latency_ms": latency_ms,
"timestamp": utcnow(),
})
When something breaks in production, this is the difference between a 20-minute debug session and a 3-day investigation.
Failure handling: graceful, not silent
Agents fail. The question is whether they fail gracefully.
My rule: never let a tool exception propagate into the agent loop unhandled. Exceptions become structured error objects that the model can reason about — "the search returned an error: rate limited. Try again in 30 seconds." — rather than stack traces that crash the run.
For retriable failures (rate limits, transient network errors), I wrap tools with exponential backoff. For non-retriable failures (bad credentials, invalid input), I return immediately with a clear error.
For the overall run, I set a timeout. If an agent run takes longer than its SLA, it is cancelled and the user gets a partial result with a clear status — not a hanging request.
This is a set of patterns, not a checklist. The right approach depends on the use case. But the decision to take cost control, observability, and failure handling seriously — rather than treating them as polish to add later — is what separates agents you can ship from agents you demo once.
If you are building a production AI agent and want someone who has shipped these patterns in real systems, reach out on Upwork.
Waqas Raza
AI-Native Full-Stack Engineer. Top Rated on Upwork · $180K+ earned · 93% job success. I build production AI agents, LLM systems, Web3 platforms, and full-stack applications.
Hire me on Upwork