Skip to content

Production agent architecture

This is the reference architecture for a durable AI agent on Conductor. Not a toy. Not a feature list. This is the exact pattern for an agent that plans, acts, waits, recovers, and runs in production.

Architecture diagram

DO_WHILE — Agent Loop (checkpointed per iteration) Start Discover Tools LIST_MCP_TOOLS Initialize Memory SET_VARIABLE Plan Next Action LLM_CHAT_COMPLETE SWITCH done? done = true needs_approval Human Approval HUMAN (durable pause) execute Execute Tool CALL_MCP_TOOL ! auto-retry Update Memory SET_VARIABLE Budget check next iteration budget exceeded End On failure: failureWorkflow runs compensation Every step persisted Prompt, response, tokens, timing

The canonical agent pattern

A production agent has these concerns. Each one maps to a specific Conductor primitive:

Agent concern Conductor primitive How it works
Plan next action LLM_CHAT_COMPLETE LLM receives goal + context + tool list, returns structured plan
Select tool at runtime DYNAMIC task LLM output determines which task type executes next
Execute tool CALL_MCP_TOOL, HTTP, or SIMPLE worker Tool runs with retry policy, timeout, and full I/O recording
Retry with backoff Task definition retryLogic FIXED, EXPONENTIAL_BACKOFF, or LINEAR_BACKOFF — no code needed
Parallel tool calls FORK/JOIN or DYNAMIC_FORK Fan out to N tools in parallel, join when all complete
Memory / context handoff SET_VARIABLE + workflow variables Accumulate results across loop iterations; pass to next LLM call
Human approval gate HUMAN task Durable pause. Survives restarts and deploys. Resumes on API signal.
Long wait (hours/days) WAIT task Timer-based durable pause. Survives server restarts.
Resume from external event HUMAN task + webhook/API External system calls Task Update API. Workflow resumes with payload.
Reflection / evaluation loop DO_WHILE with LLM-as-judge Second LLM evaluates output quality; loop continues if below threshold
Budget / iteration cap DO_WHILE loopCondition iteration < maxIterations or token/cost check in loop condition
Termination criteria DO_WHILE exit + SWITCH LLM sets done: true, or evaluator decides goal is met
Delegate to specialist SUB_WORKFLOW or START_WORKFLOW Spawn child agent. Parent waits. Failure propagates. Full observability across the tree.
Compensation on failure failureWorkflow Undo side effects: revoke API calls, send notifications, release resources
Audit trail Automatic Every task's input, output, timing, retry count, and worker ID is persisted

End-to-end workflow

Here is the complete agent as a single Conductor workflow. Every step is a native system task or operator — no custom code, no external framework.

{
  "name": "production_agent",
  "description": "Reference architecture: durable production agent",
  "version": 1,
  "schemaVersion": 2,
  "inputParameters": ["goal", "mcpServerUrl", "maxIterations"],
  "tasks": [
    {
      "name": "discover_tools",
      "taskReferenceName": "discover",
      "type": "LIST_MCP_TOOLS",
      "inputParameters": {
        "mcpServer": "${workflow.input.mcpServerUrl}"
      }
    },
    {
      "name": "initialize_memory",
      "taskReferenceName": "init_memory",
      "type": "SET_VARIABLE",
      "inputParameters": {
        "context": [],
        "actions_taken": []
      }
    },
    {
      "name": "agent_loop",
      "taskReferenceName": "loop",
      "type": "DO_WHILE",
      "loopCondition": "if ($.loop['plan'].output.result.done == true) { false; } else if ($.loop['plan'].output.iteration >= $.maxIterations) { false; } else { true; }",
      "inputParameters": {
        "maxIterations": "${workflow.input.maxIterations}"
      },
      "loopOver": [
        {
          "name": "plan_next_action",
          "taskReferenceName": "plan",
          "type": "LLM_CHAT_COMPLETE",
          "inputParameters": {
            "llmProvider": "anthropic",
            "model": "claude-sonnet-4-20250514",
            "messages": [
              {
                "role": "system",
                "message": "You are a production AI agent. Goal: ${workflow.input.goal}\n\nAvailable tools: ${discover.output.tools}\n\nPrevious actions and results: ${workflow.variables.context}\n\nDecide the next action. Respond with JSON:\n- To use a tool: {\"action\": \"tool_name\", \"arguments\": {}, \"reasoning\": \"why\", \"needs_approval\": true/false, \"done\": false}\n- To finish: {\"answer\": \"final answer\", \"done\": true}"
              }
            ],
            "temperature": 0.1,
            "maxTokens": 1000
          }
        },
        {
          "name": "check_if_done",
          "taskReferenceName": "done_check",
          "type": "SWITCH",
          "evaluatorType": "javascript",
          "expression": "$.plan.output.result.done ? 'done' : ($.plan.output.result.needs_approval ? 'needs_approval' : 'execute')",
          "decisionCases": {
            "needs_approval": [
              {
                "name": "human_approval",
                "taskReferenceName": "approval",
                "type": "HUMAN",
                "inputParameters": {
                  "plannedAction": "${plan.output.result.action}",
                  "arguments": "${plan.output.result.arguments}",
                  "reasoning": "${plan.output.result.reasoning}",
                  "goal": "${workflow.input.goal}"
                }
              },
              {
                "name": "execute_approved_tool",
                "taskReferenceName": "approved_tool_call",
                "type": "CALL_MCP_TOOL",
                "inputParameters": {
                  "mcpServer": "${workflow.input.mcpServerUrl}",
                  "method": "${plan.output.result.action}",
                  "arguments": "${plan.output.result.arguments}"
                }
              },
              {
                "name": "update_memory_approved",
                "taskReferenceName": "mem_update_approved",
                "type": "SET_VARIABLE",
                "inputParameters": {
                  "context": "${workflow.variables.context.concat([{action: plan.output.result.action, result: approved_tool_call.output.content, approved: true}])}"
                }
              }
            ],
            "execute": [
              {
                "name": "execute_tool",
                "taskReferenceName": "tool_call",
                "type": "CALL_MCP_TOOL",
                "inputParameters": {
                  "mcpServer": "${workflow.input.mcpServerUrl}",
                  "method": "${plan.output.result.action}",
                  "arguments": "${plan.output.result.arguments}"
                }
              },
              {
                "name": "update_memory",
                "taskReferenceName": "mem_update",
                "type": "SET_VARIABLE",
                "inputParameters": {
                  "context": "${workflow.variables.context.concat([{action: plan.output.result.action, result: tool_call.output.content}])}"
                }
              }
            ]
          },
          "defaultCase": []
        }
      ]
    }
  ],
  "outputParameters": {
    "answer": "${loop.output.plan.output.result.answer}",
    "iterations": "${loop.output.iteration}",
    "actions_taken": "${workflow.variables.context}"
  },
  "failureWorkflow": "agent_compensation_workflow"
}

What makes this production-ready

Every step is a durable checkpoint

Each iteration of DO_WHILE is persisted before the next begins. If the agent crashes at iteration 15 of 20, it resumes from iteration 15 — not from scratch. Every LLM prompt, response, tool call, and human decision is recorded.

Human approval is a durable gate

The HUMAN task pauses the workflow indefinitely. The pause survives server restarts, deploys, and infrastructure changes. When a reviewer approves via the API or UI, the workflow resumes with the approval payload as task output. No polling, no timeouts (unless you configure one), no lost approvals.

Retry is automatic and configurable

Every tool call (CALL_MCP_TOOL, HTTP, SIMPLE) inherits retry behavior from its task definition:

{
  "name": "execute_tool",
  "retryCount": 3,
  "retryLogic": "EXPONENTIAL_BACKOFF",
  "retryDelaySeconds": 2,
  "responseTimeoutSeconds": 30
}

If the MCP server is down, Conductor retries with exponential backoff. The LLM is not re-called — only the failed tool call retries.

Memory persists across iterations

SET_VARIABLE stores accumulated context in workflow variables. These variables are persisted to durable storage and available to every subsequent task. The LLM receives the full history of actions and results on each iteration.

Budget cap prevents runaway agents

The loopCondition checks both the agent's done flag and an iteration cap. You can also check token usage or cost in the condition. The agent terminates cleanly when the budget is exhausted.

Compensation handles side effects

If the agent fails after taking real-world actions (sent an email, created a record, charged a payment), the failureWorkflow runs compensating tasks automatically. The compensation workflow receives the full execution context: which actions succeeded, which failed, and why.

Observability is automatic

Open the Conductor UI to see:

  • The exact task graph for this execution
  • Every LLM prompt and response (click any LLM_CHAT_COMPLETE task)
  • Every tool call with input, output, and timing
  • Every human approval with who approved and when
  • The iteration count and loop state
  • Retry history for any failed task
  • The full workflow input, output, and variables

Extending the pattern

Add parallel research

Replace a single tool call with DYNAMIC_FORK to fan out to multiple tools in parallel:

{
  "name": "parallel_research",
  "taskReferenceName": "research",
  "type": "DYNAMIC_FORK",
  "inputParameters": {
    "dynamicTasks": "${plan.output.result.parallel_tasks}",
    "dynamicTasksInput": "${plan.output.result.task_inputs}"
  },
  "dynamicForkTasksParam": "dynamicTasks",
  "dynamicForkTasksInputParamName": "dynamicTasksInput"
}

The LLM decides how many tools to call in parallel and with what inputs. Conductor creates the branches at runtime.

Add a reflection / evaluation step

Insert an LLM-as-judge after tool execution to evaluate output quality:

{
  "name": "evaluate_result",
  "taskReferenceName": "evaluator",
  "type": "LLM_CHAT_COMPLETE",
  "inputParameters": {
    "llmProvider": "anthropic",
    "model": "claude-sonnet-4-20250514",
    "messages": [
      {
        "role": "system",
        "message": "Evaluate this result against the goal. Is it sufficient? Respond with JSON: {\"quality\": \"good\" or \"insufficient\", \"feedback\": \"...\"}"
      },
      {
        "role": "user",
        "message": "Goal: ${workflow.input.goal}\nResult: ${tool_call.output.content}"
      }
    ]
  }
}

If the evaluator returns insufficient, the loop continues with the feedback as context for the next planning step.

Add long waits

Insert a WAIT task for time-based pauses (rate limiting, cooldown periods, scheduled actions):

{
  "name": "wait_before_retry",
  "taskReferenceName": "cooldown",
  "type": "WAIT",
  "inputParameters": {
    "duration": "1 hour"
  }
}

The wait is durable. The workflow does not consume resources while waiting. After 1 hour — even if the server restarted during that time — the workflow resumes.

Delegate to specialist agents

Use SUB_WORKFLOW to spawn a child agent for a specialized task:

{
  "name": "delegate_to_researcher",
  "taskReferenceName": "research_agent",
  "type": "SUB_WORKFLOW",
  "inputParameters": {
    "name": "research_agent_workflow",
    "version": 1,
    "input": {
      "topic": "${plan.output.result.research_topic}",
      "mcpServerUrl": "${workflow.input.mcpServerUrl}"
    }
  }
}

The parent agent waits for the child to complete. If the child fails, the parent's failure handling kicks in. The entire agent tree is observable in the UI — drill from parent to child to sub-child.

The primitives, mapped

"I need my agent to..." Use this Why
Wait for a tool callback HUMAN task or async completion Durable pause. Resumes on API signal with payload.
Sleep until a retry window WAIT task Timer-based durable pause. Zero resource consumption.
Pick the next tool at runtime DYNAMIC task LLM output determines task type. Resolved at execution time.
Call multiple tools in parallel FORK/JOIN or DYNAMIC_FORK Static or runtime-determined parallelism. Join waits for all.
Loop until goal is met DO_WHILE Checkpointed loop. Each iteration persisted.
Delegate to a specialist agent SUB_WORKFLOW or START_WORKFLOW Child workflow with full lifecycle management.
Accumulate context across steps SET_VARIABLE Workflow variables persisted to durable storage.
Evaluate output quality LLM_CHAT_COMPLETE as evaluator LLM-as-judge pattern inside the loop.
Cap iterations or cost DO_WHILE loopCondition Check iteration count, token usage, or cost.
Undo side effects on failure failureWorkflow Compensation tasks run automatically on workflow failure.
Pause for human review HUMAN task Indefinite durable pause. Survives restarts and deploys.
Resume on external event HUMAN task + API/webhook External system calls Task Update API with payload.
Post-process structured output INLINE (JavaScript) or JSON_JQ_TRANSFORM Server-side transforms without a worker.

Next steps