Production agent architecture
This is the reference architecture for a durable AI agent on Conductor. Not a toy. Not a feature list. This is the exact pattern for an agent that plans, acts, waits, recovers, and runs in production.
Architecture diagram
The canonical agent pattern
A production agent has these concerns. Each one maps to a specific Conductor primitive:
| Agent concern | Conductor primitive | How it works |
|---|---|---|
| Plan next action | LLM_CHAT_COMPLETE |
LLM receives goal + context + tool list, returns structured plan |
| Select tool at runtime | DYNAMIC task |
LLM output determines which task type executes next |
| Execute tool | CALL_MCP_TOOL, HTTP, or SIMPLE worker |
Tool runs with retry policy, timeout, and full I/O recording |
| Retry with backoff | Task definition retryLogic |
FIXED, EXPONENTIAL_BACKOFF, or LINEAR_BACKOFF — no code needed |
| Parallel tool calls | FORK/JOIN or DYNAMIC_FORK |
Fan out to N tools in parallel, join when all complete |
| Memory / context handoff | SET_VARIABLE + workflow variables |
Accumulate results across loop iterations; pass to next LLM call |
| Human approval gate | HUMAN task |
Durable pause. Survives restarts and deploys. Resumes on API signal. |
| Long wait (hours/days) | WAIT task |
Timer-based durable pause. Survives server restarts. |
| Resume from external event | HUMAN task + webhook/API |
External system calls Task Update API. Workflow resumes with payload. |
| Reflection / evaluation loop | DO_WHILE with LLM-as-judge |
Second LLM evaluates output quality; loop continues if below threshold |
| Budget / iteration cap | DO_WHILE loopCondition |
iteration < maxIterations or token/cost check in loop condition |
| Termination criteria | DO_WHILE exit + SWITCH |
LLM sets done: true, or evaluator decides goal is met |
| Delegate to specialist | SUB_WORKFLOW or START_WORKFLOW |
Spawn child agent. Parent waits. Failure propagates. Full observability across the tree. |
| Compensation on failure | failureWorkflow |
Undo side effects: revoke API calls, send notifications, release resources |
| Audit trail | Automatic | Every task's input, output, timing, retry count, and worker ID is persisted |
End-to-end workflow
Here is the complete agent as a single Conductor workflow. Every step is a native system task or operator — no custom code, no external framework.
{
"name": "production_agent",
"description": "Reference architecture: durable production agent",
"version": 1,
"schemaVersion": 2,
"inputParameters": ["goal", "mcpServerUrl", "maxIterations"],
"tasks": [
{
"name": "discover_tools",
"taskReferenceName": "discover",
"type": "LIST_MCP_TOOLS",
"inputParameters": {
"mcpServer": "${workflow.input.mcpServerUrl}"
}
},
{
"name": "initialize_memory",
"taskReferenceName": "init_memory",
"type": "SET_VARIABLE",
"inputParameters": {
"context": [],
"actions_taken": []
}
},
{
"name": "agent_loop",
"taskReferenceName": "loop",
"type": "DO_WHILE",
"loopCondition": "if ($.loop['plan'].output.result.done == true) { false; } else if ($.loop['plan'].output.iteration >= $.maxIterations) { false; } else { true; }",
"inputParameters": {
"maxIterations": "${workflow.input.maxIterations}"
},
"loopOver": [
{
"name": "plan_next_action",
"taskReferenceName": "plan",
"type": "LLM_CHAT_COMPLETE",
"inputParameters": {
"llmProvider": "anthropic",
"model": "claude-sonnet-4-20250514",
"messages": [
{
"role": "system",
"message": "You are a production AI agent. Goal: ${workflow.input.goal}\n\nAvailable tools: ${discover.output.tools}\n\nPrevious actions and results: ${workflow.variables.context}\n\nDecide the next action. Respond with JSON:\n- To use a tool: {\"action\": \"tool_name\", \"arguments\": {}, \"reasoning\": \"why\", \"needs_approval\": true/false, \"done\": false}\n- To finish: {\"answer\": \"final answer\", \"done\": true}"
}
],
"temperature": 0.1,
"maxTokens": 1000
}
},
{
"name": "check_if_done",
"taskReferenceName": "done_check",
"type": "SWITCH",
"evaluatorType": "javascript",
"expression": "$.plan.output.result.done ? 'done' : ($.plan.output.result.needs_approval ? 'needs_approval' : 'execute')",
"decisionCases": {
"needs_approval": [
{
"name": "human_approval",
"taskReferenceName": "approval",
"type": "HUMAN",
"inputParameters": {
"plannedAction": "${plan.output.result.action}",
"arguments": "${plan.output.result.arguments}",
"reasoning": "${plan.output.result.reasoning}",
"goal": "${workflow.input.goal}"
}
},
{
"name": "execute_approved_tool",
"taskReferenceName": "approved_tool_call",
"type": "CALL_MCP_TOOL",
"inputParameters": {
"mcpServer": "${workflow.input.mcpServerUrl}",
"method": "${plan.output.result.action}",
"arguments": "${plan.output.result.arguments}"
}
},
{
"name": "update_memory_approved",
"taskReferenceName": "mem_update_approved",
"type": "SET_VARIABLE",
"inputParameters": {
"context": "${workflow.variables.context.concat([{action: plan.output.result.action, result: approved_tool_call.output.content, approved: true}])}"
}
}
],
"execute": [
{
"name": "execute_tool",
"taskReferenceName": "tool_call",
"type": "CALL_MCP_TOOL",
"inputParameters": {
"mcpServer": "${workflow.input.mcpServerUrl}",
"method": "${plan.output.result.action}",
"arguments": "${plan.output.result.arguments}"
}
},
{
"name": "update_memory",
"taskReferenceName": "mem_update",
"type": "SET_VARIABLE",
"inputParameters": {
"context": "${workflow.variables.context.concat([{action: plan.output.result.action, result: tool_call.output.content}])}"
}
}
]
},
"defaultCase": []
}
]
}
],
"outputParameters": {
"answer": "${loop.output.plan.output.result.answer}",
"iterations": "${loop.output.iteration}",
"actions_taken": "${workflow.variables.context}"
},
"failureWorkflow": "agent_compensation_workflow"
}
What makes this production-ready
Every step is a durable checkpoint
Each iteration of DO_WHILE is persisted before the next begins. If the agent crashes at iteration 15 of 20, it resumes from iteration 15 — not from scratch. Every LLM prompt, response, tool call, and human decision is recorded.
Human approval is a durable gate
The HUMAN task pauses the workflow indefinitely. The pause survives server restarts, deploys, and infrastructure changes. When a reviewer approves via the API or UI, the workflow resumes with the approval payload as task output. No polling, no timeouts (unless you configure one), no lost approvals.
Retry is automatic and configurable
Every tool call (CALL_MCP_TOOL, HTTP, SIMPLE) inherits retry behavior from its task definition:
{
"name": "execute_tool",
"retryCount": 3,
"retryLogic": "EXPONENTIAL_BACKOFF",
"retryDelaySeconds": 2,
"responseTimeoutSeconds": 30
}
If the MCP server is down, Conductor retries with exponential backoff. The LLM is not re-called — only the failed tool call retries.
Memory persists across iterations
SET_VARIABLE stores accumulated context in workflow variables. These variables are persisted to durable storage and available to every subsequent task. The LLM receives the full history of actions and results on each iteration.
Budget cap prevents runaway agents
The loopCondition checks both the agent's done flag and an iteration cap. You can also check token usage or cost in the condition. The agent terminates cleanly when the budget is exhausted.
Compensation handles side effects
If the agent fails after taking real-world actions (sent an email, created a record, charged a payment), the failureWorkflow runs compensating tasks automatically. The compensation workflow receives the full execution context: which actions succeeded, which failed, and why.
Observability is automatic
Open the Conductor UI to see:
- The exact task graph for this execution
- Every LLM prompt and response (click any
LLM_CHAT_COMPLETEtask) - Every tool call with input, output, and timing
- Every human approval with who approved and when
- The iteration count and loop state
- Retry history for any failed task
- The full workflow input, output, and variables
Extending the pattern
Add parallel research
Replace a single tool call with DYNAMIC_FORK to fan out to multiple tools in parallel:
{
"name": "parallel_research",
"taskReferenceName": "research",
"type": "DYNAMIC_FORK",
"inputParameters": {
"dynamicTasks": "${plan.output.result.parallel_tasks}",
"dynamicTasksInput": "${plan.output.result.task_inputs}"
},
"dynamicForkTasksParam": "dynamicTasks",
"dynamicForkTasksInputParamName": "dynamicTasksInput"
}
The LLM decides how many tools to call in parallel and with what inputs. Conductor creates the branches at runtime.
Add a reflection / evaluation step
Insert an LLM-as-judge after tool execution to evaluate output quality:
{
"name": "evaluate_result",
"taskReferenceName": "evaluator",
"type": "LLM_CHAT_COMPLETE",
"inputParameters": {
"llmProvider": "anthropic",
"model": "claude-sonnet-4-20250514",
"messages": [
{
"role": "system",
"message": "Evaluate this result against the goal. Is it sufficient? Respond with JSON: {\"quality\": \"good\" or \"insufficient\", \"feedback\": \"...\"}"
},
{
"role": "user",
"message": "Goal: ${workflow.input.goal}\nResult: ${tool_call.output.content}"
}
]
}
}
If the evaluator returns insufficient, the loop continues with the feedback as context for the next planning step.
Add long waits
Insert a WAIT task for time-based pauses (rate limiting, cooldown periods, scheduled actions):
{
"name": "wait_before_retry",
"taskReferenceName": "cooldown",
"type": "WAIT",
"inputParameters": {
"duration": "1 hour"
}
}
The wait is durable. The workflow does not consume resources while waiting. After 1 hour — even if the server restarted during that time — the workflow resumes.
Delegate to specialist agents
Use SUB_WORKFLOW to spawn a child agent for a specialized task:
{
"name": "delegate_to_researcher",
"taskReferenceName": "research_agent",
"type": "SUB_WORKFLOW",
"inputParameters": {
"name": "research_agent_workflow",
"version": 1,
"input": {
"topic": "${plan.output.result.research_topic}",
"mcpServerUrl": "${workflow.input.mcpServerUrl}"
}
}
}
The parent agent waits for the child to complete. If the child fails, the parent's failure handling kicks in. The entire agent tree is observable in the UI — drill from parent to child to sub-child.
The primitives, mapped
| "I need my agent to..." | Use this | Why |
|---|---|---|
| Wait for a tool callback | HUMAN task or async completion |
Durable pause. Resumes on API signal with payload. |
| Sleep until a retry window | WAIT task |
Timer-based durable pause. Zero resource consumption. |
| Pick the next tool at runtime | DYNAMIC task |
LLM output determines task type. Resolved at execution time. |
| Call multiple tools in parallel | FORK/JOIN or DYNAMIC_FORK |
Static or runtime-determined parallelism. Join waits for all. |
| Loop until goal is met | DO_WHILE |
Checkpointed loop. Each iteration persisted. |
| Delegate to a specialist agent | SUB_WORKFLOW or START_WORKFLOW |
Child workflow with full lifecycle management. |
| Accumulate context across steps | SET_VARIABLE |
Workflow variables persisted to durable storage. |
| Evaluate output quality | LLM_CHAT_COMPLETE as evaluator |
LLM-as-judge pattern inside the loop. |
| Cap iterations or cost | DO_WHILE loopCondition |
Check iteration count, token usage, or cost. |
| Undo side effects on failure | failureWorkflow |
Compensation tasks run automatically on workflow failure. |
| Pause for human review | HUMAN task |
Indefinite durable pause. Survives restarts and deploys. |
| Resume on external event | HUMAN task + API/webhook |
External system calls Task Update API with payload. |
| Post-process structured output | INLINE (JavaScript) or JSON_JQ_TRANSFORM |
Server-side transforms without a worker. |
Next steps
- Failure Semantics for AI Agents — The exact failure contract: what happens under crashes, retries, duplicates, and long waits.
- Why Conductor for Agents — What Conductor gives you out of the box for agentic workflows.
- Build Your First AI Agent — Start simple and build up to this architecture in 5 minutes.
- MCP Integration — Connect to any MCP server, expose workflows as MCP tools.
- Token Efficiency — How durable execution saves tokens and reduces LLM costs.