Best Practices
This guide covers production best practices for running Conductor as a durable execution engine at scale. Every recommendation here comes from real-world operational experience.
Idempotent workers
Conductor guarantees at-least-once task delivery. Network partitions, worker restarts, and response timeouts can all cause a task to be delivered more than once. Your workers must be idempotent — executing the same task twice should produce the same result without side effects.
Patterns for idempotency:
| Pattern | When to use |
|---|---|
| Idempotency key | Pass a unique key (e.g., workflowId + taskId) to downstream services. The service deduplicates on this key. |
| Upsert instead of insert | Use INSERT ... ON CONFLICT UPDATE or equivalent so repeated writes converge to the same state. |
| Check-then-act | Query current state before performing the action. Skip if already completed. |
| Idempotent HTTP methods | Prefer PUT over POST when the downstream API supports it. |
from conductor.client.worker.worker_task import worker_task
@worker_task(task_definition_name="charge_payment")
def charge_payment(workflow_id: str, task_id: str, amount: float, currency: str) -> dict:
idempotency_key = f"{workflow_id}-{task_id}"
# Check if this charge was already processed
existing = payment_gateway.get_charge(idempotency_key)
if existing:
return {"chargeId": existing.id, "status": "already_processed"}
charge = payment_gateway.create_charge(
amount=amount, currency=currency, idempotency_key=idempotency_key
)
return {"chargeId": charge.id, "status": "charged"}
The workflowId and taskId combination is unique per task execution attempt, making it an ideal idempotency key.
Timeout configuration
Every task definition should have explicit timeouts. A task without timeouts can block a workflow indefinitely.
The rule: responseTimeoutSeconds < timeoutSeconds. The response timeout detects unresponsive workers; the overall timeout enforces the SLA.
Recommended configurations
| Task pattern | responseTimeoutSeconds |
timeoutSeconds |
timeoutPolicy |
retryCount |
|---|---|---|---|---|
| API call (< 5s expected) | 10 | 30 | RETRY |
3 |
| ML inference | 120 | 300 | RETRY |
1 |
| Human approval | 0 (disabled) | 86400 | ALERT_ONLY |
0 |
| Batch processing | 600 | 3600 | TIME_OUT_WF |
0 |
| Quick data transform | 5 | 15 | RETRY |
3 |
Timeout policies
| Policy | Behavior | Use when |
|---|---|---|
RETRY |
Retries the task up to retryCount times. |
Transient failures are expected (network calls, external APIs). |
TIME_OUT_WF |
Fails the entire workflow immediately. | The task is critical and retrying won't help (e.g., expired batch window). |
ALERT_ONLY |
Marks the task as timed out but keeps the workflow running. | Human-in-the-loop tasks or tasks with external completion signals. |
Warning
Setting responseTimeoutSeconds to 0 disables the response timeout. Only do this for tasks that are completed externally (e.g., WAIT or Human tasks).
See Task Definitions for the full parameter reference.
Payload management
Conductor stores task inputs and outputs in its database. Large payloads degrade performance and increase storage costs.
Size guidelines
| Payload | Recommended limit | Hard limit (configurable) |
|---|---|---|
| Task input | < 64 KB | 1 MB |
| Task output | < 64 KB | 1 MB |
| Workflow input | < 64 KB | 1 MB |
External payload storage
For payloads exceeding 64 KB, use external payload storage. Conductor supports S3 out of the box:
{
"conductor.external-payload-storage.type": "s3",
"conductor.external-payload-storage.s3.bucket-name": "my-conductor-payloads",
"conductor.external-payload-storage.s3.region": "us-east-1",
"conductor.external-payload-storage.s3.signed-url-expiration-seconds": 300
}
Do's and don'ts
| Do | Don't |
|---|---|
| Return only data that downstream tasks need. | Dump entire API responses into task output. |
| Store large files in S3/GCS and pass the URI. | Pass file contents as base64 in payloads. |
Use inputTemplate to set default values on the task definition. |
Duplicate static config in every workflow definition. |
| Keep payload keys flat and descriptive. | Nest payloads 5 levels deep with ambiguous keys. |
Workflow design
Small, focused tasks over monolithic workers
Break work into small tasks that each do one thing. This gives you:
- Granular retries — only the failed step retries, not the entire pipeline.
- Reusability — small tasks compose into different workflows.
- Visibility — each step is independently observable in the Conductor UI.
Sub-workflows vs inline tasks
| Approach | When to use |
|---|---|
| Sub-workflow | Reusable logic shared across multiple parent workflows. Independently versioned and testable. |
| Inline tasks in a single workflow | Logic specific to one workflow. Fewer indirections to debug. |
Use sub-workflows when a group of tasks represents a bounded business capability (e.g., "process payment", "send notification bundle"). Don't create sub-workflows for a single task — the overhead isn't worth it.
DYNAMIC_FORK vs sequential loops
| Pattern | When to use |
|---|---|
| DYNAMIC_FORK | Process N items in parallel. Use when items are independent and parallelism improves throughput. |
| DO_WHILE | Process items sequentially when ordering matters or a shared resource requires serialization. |
Tip
Keep DYNAMIC_FORK fan-out under 500 concurrent tasks per workflow. Beyond that, consider batching items into chunks and forking over the chunks.
Worker scaling
Workers are stateless and scale horizontally. Tune these parameters to match your workload.
Polling interval
The polling interval controls how frequently workers check for new tasks. Shorter intervals reduce latency; longer intervals reduce server load.
| Workload | Recommended polling interval |
|---|---|
| Low-latency (< 1s SLA) | 100-250 ms |
| Standard processing | 500 ms - 1s |
| Background / batch | 5-10s |
Thread pool sizing
Each worker instance runs a configurable number of polling threads. Start with:
For example, 100 tasks/sec with 2s average execution across 5 instances: (100 * 2) / 5 = 40 threads per instance.
Rate limiting and concurrency
Use task definition settings to protect downstream services:
{
"name": "call_external_api",
"rateLimitPerFrequency": 50,
"rateLimitFrequencyInSeconds": 1,
"concurrentExecLimit": 20
}
This limits the task to 50 executions per second globally, with at most 20 running concurrently.
Domain isolation
Use task domains to route tasks to specific worker pools. Common use cases:
- Environment isolation — dev workers only pick up dev tasks.
- Priority lanes — premium customers routed to dedicated capacity.
- Regional affinity — route tasks to workers closest to the data.
See Scaling Workers for more detail.
Error handling patterns
Retries vs terminal failure
By default, a failed task is retried according to retryCount and retryLogic (FIXED, EXPONENTIAL_BACKOFF, or LINEAR_BACKOFF). For errors that should not be retried, set the task status to FAILED_WITH_TERMINAL_ERROR:
from conductor.client.http.models import TaskResult, TaskResultStatus
@worker_task(task_definition_name="validate_order")
def validate_order(order_id: str, items: list) -> TaskResult:
if not items:
result = TaskResult()
result.status = TaskResultStatus.FAILED_WITH_TERMINAL_ERROR
result.reason_for_incompletion = "Order has no items — not retryable"
return result
# ... validation logic
return {"valid": True}
| Error type | Strategy |
|---|---|
| Transient (network timeout, 503) | Let Conductor retry with backoff. |
| Client error (400, validation failure) | Return FAILED_WITH_TERMINAL_ERROR. |
| Partial failure in batch | Return partial results as output; use workflow logic to handle remainder. |
Compensation and saga patterns
For workflows that span multiple services, design compensation tasks to undo completed steps when a later step fails.
Forward compensation — Fix the problem and continue. Use a SWITCH after the failed task to route to a recovery path.
Backward compensation — Undo completed work in reverse order. Model this as a separate workflow triggered by the failure workflow mechanism:
- The main workflow fails at step 3.
- Conductor invokes the configured
failureWorkflow. - The failure workflow runs compensating tasks: undo step 2, then undo step 1.
Tip
Store compensation metadata (transaction IDs, resource handles) in each task's output so the failure workflow has everything it needs to roll back.
Versioning and deployments
Conductor supports workflow versioning natively. Use this for safe deployments.
Blue-green with versions
- Deploy workflow version N+1 with your changes.
- Start new executions on version N+1.
- Let existing version N executions drain to completion.
- Once all version N executions are complete, deprecate or remove it.
Migrating running executions
Running workflows continue on the version they were started with. You cannot migrate a running execution to a new version. Plan for this:
- Short-lived workflows — Wait for drain. Most complete within minutes.
- Long-running workflows — If a critical fix is needed, terminate and restart on the new version. Use the Terminate API with a reason, then re-trigger.
Safe rollback
If version N+1 has issues:
- Stop starting new executions on N+1 (route traffic back to N).
- Let N+1 executions fail or terminate them.
- Resume on version N, which was never modified.
Because workers are decoupled from workflow definitions, you can roll back the workflow version independently of worker deployments.
Monitoring
Track these metrics to maintain healthy Conductor operations:
| Metric | What it tells you | Alert threshold |
|---|---|---|
| Task queue depth | Backlog of unprocessed tasks. | Growing consistently over 5 minutes. |
| Task poll count (per task type) | Whether workers are actively polling. | Drops to zero. |
| Workflow failure rate | Percentage of workflows ending in FAILED state. | > 5% over a 15-minute window. |
| Task response time (p99) | How close workers are to the response timeout. | > 80% of responseTimeoutSeconds. |
| Worker thread utilization | Whether workers are saturated. | > 90% sustained for 10 minutes. |
| External payload storage errors | S3/GCS write failures blocking tasks. | Any non-zero count. |
See Monitoring and Scaling Workers for built-in monitoring tools.