Production agent architecture

This is the reference architecture for a durable AI agent on Conductor. Not a toy. Not a feature list. This is the exact pattern for an agent that plans, acts, waits, recovers, and runs in production.

Architecture diagram

The canonical agent pattern

A production agent has these concerns. Each one maps to a specific Conductor primitive:

Agent concern	Conductor primitive	How it works
Plan next action	`LLM_CHAT_COMPLETE`	LLM receives goal + context + tool list, returns structured plan
Select tool at runtime	`DYNAMIC` task	LLM output determines which task type executes next
Execute tool	`CALL_MCP_TOOL`, `HTTP`, or `SIMPLE` worker	Tool runs with retry policy, timeout, and full I/O recording
Retry with backoff	Task definition `retryLogic`	`FIXED`, `EXPONENTIAL_BACKOFF`, or `LINEAR_BACKOFF` — no code needed
Parallel tool calls	`FORK/JOIN` or `DYNAMIC_FORK`	Fan out to N tools in parallel, join when all complete
Memory / context handoff	`SET_VARIABLE` + workflow variables	Accumulate results across loop iterations; pass to next LLM call
Human approval gate	`HUMAN` task	Durable pause. Survives restarts and deploys. Resumes on API signal.
Long wait (hours/days)	`WAIT` task	Timer-based durable pause. Survives server restarts.
Resume from external event	`HUMAN` task + webhook/API	External system calls Task Update API. Workflow resumes with payload.
Reflection / evaluation loop	`DO_WHILE` with LLM-as-judge	Second LLM evaluates output quality; loop continues if below threshold
Budget / iteration cap	`DO_WHILE` `loopCondition`	`iteration < maxIterations` or token/cost check in loop condition
Termination criteria	`DO_WHILE` exit + `SWITCH`	LLM sets `done: true`, or evaluator decides goal is met
Delegate to specialist	`SUB_WORKFLOW` or `START_WORKFLOW`	Spawn child agent. Parent waits. Failure propagates. Full observability across the tree.
Compensation on failure	`failureWorkflow`	Undo side effects: revoke API calls, send notifications, release resources
Audit trail	Automatic	Every task's input, output, timing, retry count, and worker ID is persisted

End-to-end workflow

Here is the complete agent as a single Conductor workflow. Every step is a native system task or operator — no custom code, no external framework.

{
  "name": "production_agent",
  "description": "Reference architecture: durable production agent",
  "version": 1,
  "schemaVersion": 2,
  "inputParameters": ["goal", "mcpServerUrl", "maxIterations"],
  "tasks": [
    {
      "name": "discover_tools",
      "taskReferenceName": "discover",
      "type": "LIST_MCP_TOOLS",
      "inputParameters": {
        "mcpServer": "${workflow.input.mcpServerUrl}"
      }
    },
    {
      "name": "initialize_memory",
      "taskReferenceName": "init_memory",
      "type": "SET_VARIABLE",
      "inputParameters": {
        "context": [],
        "actions_taken": []
      }
    },
    {
      "name": "agent_loop",
      "taskReferenceName": "loop",
      "type": "DO_WHILE",
      "loopCondition": "if ($.loop['plan'].output.result.done == true) { false; } else if ($.loop['plan'].output.iteration >= $.maxIterations) { false; } else { true; }",
      "inputParameters": {
        "maxIterations": "${workflow.input.maxIterations}"
      },
      "loopOver": [
        {
          "name": "plan_next_action",
          "taskReferenceName": "plan",
          "type": "LLM_CHAT_COMPLETE",
          "inputParameters": {
            "llmProvider": "anthropic",
            "model": "claude-sonnet-4-20250514",
            "messages": [
              {
                "role": "system",
                "message": "You are a production AI agent. Goal: ${workflow.input.goal}\n\nAvailable tools: ${discover.output.tools}\n\nPrevious actions and results: ${workflow.variables.context}\n\nDecide the next action. Respond with JSON:\n- To use a tool: {\"action\": \"tool_name\", \"arguments\": {}, \"reasoning\": \"why\", \"needs_approval\": true/false, \"done\": false}\n- To finish: {\"answer\": \"final answer\", \"done\": true}"
              }
            ],
            "temperature": 0.1,
            "maxTokens": 1000
          }
        },
        {
          "name": "check_if_done",
          "taskReferenceName": "done_check",
          "type": "SWITCH",
          "evaluatorType": "javascript",
          "expression": "$.plan.output.result.done ? 'done' : ($.plan.output.result.needs_approval ? 'needs_approval' : 'execute')",
          "decisionCases": {
            "needs_approval": [
              {
                "name": "human_approval",
                "taskReferenceName": "approval",
                "type": "HUMAN",
                "inputParameters": {
                  "plannedAction": "${plan.output.result.action}",
                  "arguments": "${plan.output.result.arguments}",
                  "reasoning": "${plan.output.result.reasoning}",
                  "goal": "${workflow.input.goal}"
                }
              },
              {
                "name": "execute_approved_tool",
                "taskReferenceName": "approved_tool_call",
                "type": "CALL_MCP_TOOL",
                "inputParameters": {
                  "mcpServer": "${workflow.input.mcpServerUrl}",
                  "method": "${plan.output.result.action}",
                  "arguments": "${plan.output.result.arguments}"
                }
              },
              {
                "name": "update_memory_approved",
                "taskReferenceName": "mem_update_approved",
                "type": "SET_VARIABLE",
                "inputParameters": {
                  "context": "${workflow.variables.context.concat([{action: plan.output.result.action, result: approved_tool_call.output.content, approved: true}])}"
                }
              }
            ],
            "execute": [
              {
                "name": "execute_tool",
                "taskReferenceName": "tool_call",
                "type": "CALL_MCP_TOOL",
                "inputParameters": {
                  "mcpServer": "${workflow.input.mcpServerUrl}",
                  "method": "${plan.output.result.action}",
                  "arguments": "${plan.output.result.arguments}"
                }
              },
              {
                "name": "update_memory",
                "taskReferenceName": "mem_update",
                "type": "SET_VARIABLE",
                "inputParameters": {
                  "context": "${workflow.variables.context.concat([{action: plan.output.result.action, result: tool_call.output.content}])}"
                }
              }
            ]
          },
          "defaultCase": []
        }
      ]
    }
  ],
  "outputParameters": {
    "answer": "${loop.output.plan.output.result.answer}",
    "iterations": "${loop.output.iteration}",
    "actions_taken": "${workflow.variables.context}"
  },
  "failureWorkflow": "agent_compensation_workflow"
}

What makes this production-ready

Every step is a durable checkpoint

Each iteration of DO_WHILE is persisted before the next begins. If the agent crashes at iteration 15 of 20, it resumes from iteration 15 — not from scratch. Every LLM prompt, response, tool call, and human decision is recorded.

Human approval is a durable gate

The HUMAN task pauses the workflow indefinitely. The pause survives server restarts, deploys, and infrastructure changes. When a reviewer approves via the API or UI, the workflow resumes with the approval payload as task output. No polling, no timeouts (unless you configure one), no lost approvals.

Retry is automatic and configurable

Every tool call (CALL_MCP_TOOL, HTTP, SIMPLE) inherits retry behavior from its task definition:

{
  "name": "execute_tool",
  "retryCount": 3,
  "retryLogic": "EXPONENTIAL_BACKOFF",
  "retryDelaySeconds": 2,
  "responseTimeoutSeconds": 30
}

If the MCP server is down, Conductor retries with exponential backoff. The LLM is not re-called — only the failed tool call retries.

Memory persists across iterations

SET_VARIABLE stores accumulated context in workflow variables. These variables are persisted to durable storage and available to every subsequent task. The LLM receives the full history of actions and results on each iteration.

Budget cap prevents runaway agents

The loopCondition checks both the agent's done flag and an iteration cap. You can also check token usage or cost in the condition. The agent terminates cleanly when the budget is exhausted.

Compensation handles side effects

If the agent fails after taking real-world actions (sent an email, created a record, charged a payment), the failureWorkflow runs compensating tasks automatically. The compensation workflow receives the full execution context: which actions succeeded, which failed, and why.

Observability is automatic

Open the Conductor UI to see:

The exact task graph for this execution
Every LLM prompt and response (click any LLM_CHAT_COMPLETE task)
Every tool call with input, output, and timing
Every human approval with who approved and when
The iteration count and loop state
Retry history for any failed task
The full workflow input, output, and variables

Extending the pattern

Add parallel research

Replace a single tool call with DYNAMIC_FORK to fan out to multiple tools in parallel:

{
  "name": "parallel_research",
  "taskReferenceName": "research",
  "type": "DYNAMIC_FORK",
  "inputParameters": {
    "dynamicTasks": "${plan.output.result.parallel_tasks}",
    "dynamicTasksInput": "${plan.output.result.task_inputs}"
  },
  "dynamicForkTasksParam": "dynamicTasks",
  "dynamicForkTasksInputParamName": "dynamicTasksInput"
}

The LLM decides how many tools to call in parallel and with what inputs. Conductor creates the branches at runtime.

Add a reflection / evaluation step

Insert an LLM-as-judge after tool execution to evaluate output quality:

{
  "name": "evaluate_result",
  "taskReferenceName": "evaluator",
  "type": "LLM_CHAT_COMPLETE",
  "inputParameters": {
    "llmProvider": "anthropic",
    "model": "claude-sonnet-4-20250514",
    "messages": [
      {
        "role": "system",
        "message": "Evaluate this result against the goal. Is it sufficient? Respond with JSON: {\"quality\": \"good\" or \"insufficient\", \"feedback\": \"...\"}"
      },
      {
        "role": "user",
        "message": "Goal: ${workflow.input.goal}\nResult: ${tool_call.output.content}"
      }
    ]
  }
}

If the evaluator returns insufficient, the loop continues with the feedback as context for the next planning step.

Add long waits

Insert a WAIT task for time-based pauses (rate limiting, cooldown periods, scheduled actions):

{
  "name": "wait_before_retry",
  "taskReferenceName": "cooldown",
  "type": "WAIT",
  "inputParameters": {
    "duration": "1 hour"
  }
}

The wait is durable. The workflow does not consume resources while waiting. After 1 hour — even if the server restarted during that time — the workflow resumes.

Delegate to specialist agents

Use SUB_WORKFLOW to spawn a child agent for a specialized task:

{
  "name": "delegate_to_researcher",
  "taskReferenceName": "research_agent",
  "type": "SUB_WORKFLOW",
  "inputParameters": {
    "name": "research_agent_workflow",
    "version": 1,
    "input": {
      "topic": "${plan.output.result.research_topic}",
      "mcpServerUrl": "${workflow.input.mcpServerUrl}"
    }
  }
}

The parent agent waits for the child to complete. If the child fails, the parent's failure handling kicks in. The entire agent tree is observable in the UI — drill from parent to child to sub-child.

The primitives, mapped

"I need my agent to..."	Use this	Why
Wait for a tool callback	`HUMAN` task or async completion	Durable pause. Resumes on API signal with payload.
Sleep until a retry window	`WAIT` task	Timer-based durable pause. Zero resource consumption.
Pick the next tool at runtime	`DYNAMIC` task	LLM output determines task type. Resolved at execution time.
Call multiple tools in parallel	`FORK/JOIN` or `DYNAMIC_FORK`	Static or runtime-determined parallelism. Join waits for all.
Loop until goal is met	`DO_WHILE`	Checkpointed loop. Each iteration persisted.
Delegate to a specialist agent	`SUB_WORKFLOW` or `START_WORKFLOW`	Child workflow with full lifecycle management.
Accumulate context across steps	`SET_VARIABLE`	Workflow variables persisted to durable storage.
Evaluate output quality	`LLM_CHAT_COMPLETE` as evaluator	LLM-as-judge pattern inside the loop.
Cap iterations or cost	`DO_WHILE` `loopCondition`	Check iteration count, token usage, or cost.
Undo side effects on failure	`failureWorkflow`	Compensation tasks run automatically on workflow failure.
Pause for human review	`HUMAN` task	Indefinite durable pause. Survives restarts and deploys.
Resume on external event	`HUMAN` task + API/webhook	External system calls Task Update API with payload.
Post-process structured output	`INLINE` (JavaScript) or `JSON_JQ_TRANSFORM`	Server-side transforms without a worker.

Next steps

Failure Semantics for AI Agents — The exact failure contract: what happens under crashes, retries, duplicates, and long waits.
Why Conductor for Agents — What Conductor gives you out of the box for agentic workflows.
Build Your First AI Agent — Start simple and build up to this architecture in 5 minutes.
MCP Integration — Connect to any MCP server, expose workflows as MCP tools.
Token Efficiency — How durable execution saves tokens and reduces LLM costs.