Task timeouts and retries

Practical recipes for making workers resilient. Each recipe is a complete task definition you can register with POST /api/metadata/taskdefs.

Exponential backoff with a cap

Retries with exponential backoff for a task that calls an external API. The cap prevents the delay from growing indefinitely; jitter prevents multiple failing workers from hammering the API at the same time.

{
  "name": "call_payment_api",
  "ownerEmail": "payments@example.com",
  "retryCount": 6,
  "retryLogic": "EXPONENTIAL_BACKOFF",
  "retryDelaySeconds": 2,
  "maxRetryDelaySeconds": 60,
  "backoffJitterMs": 3000,
  "responseTimeoutSeconds": 30,
  "timeoutSeconds": 600,
  "timeoutPolicy": "RETRY"
}

Delay schedule (retryDelaySeconds=2, maxRetryDelaySeconds=60, backoffJitterMs=3000):

Attempt	Base delay	After cap	Actual range
1	2s	2s	2.0 – 5.0s
2	4s	4s	4.0 – 7.0s
3	8s	8s	8.0 – 11.0s
4	16s	16s	16.0 – 19.0s
5	32s	32s	32.0 – 35.0s
6	64s	60s	60.0 – 63.0s

Lease extension for long-running workers

responseTimeoutSeconds is the heartbeat window: if the worker doesn't report back within this duration, Conductor marks the task TIMED_OUT and retries it. For tasks that take longer than the heartbeat window, workers extend the lease by posting an IN_PROGRESS update with callbackAfterSeconds.

Task definition

{
  "name": "transcode_video",
  "ownerEmail": "media@example.com",
  "retryCount": 2,
  "retryLogic": "FIXED",
  "retryDelaySeconds": 10,
  "responseTimeoutSeconds": 30,
  "timeoutSeconds": 3600,
  "timeoutPolicy": "RETRY"
}

responseTimeoutSeconds: 30 — Conductor will reschedule the task if the worker is silent for 30 seconds. timeoutSeconds: 3600 — the task itself can take up to 1 hour across all heartbeats.

Worker: extend the lease every 25 seconds

import time
from conductor.client.http.models import TaskResult

def transcode_video(task):
    task_id = task.task_id
    workflow_id = task.workflow_instance_id

    for chunk in video_chunks(task.input_data["file_url"]):
        transcode_chunk(chunk)

        # Extend the lease before responseTimeoutSeconds (30s) expires.
        # callbackAfterSeconds tells Conductor to leave this task invisible
        # in the queue for another 25s — resetting the response clock.
        heartbeat = TaskResult(
            task_id=task_id,
            workflow_instance_id=workflow_id,
            status="IN_PROGRESS",
            callback_after_seconds=25,
            output_data={"progress": chunk.index / len(video_chunks)}
        )
        conductor_client.update_task(heartbeat)

    return TaskResult(
        task_id=task_id,
        workflow_instance_id=workflow_id,
        status="COMPLETED",
        output_data={"output_url": upload_result.url}
    )

What happens without a heartbeat:

t=0s   Worker polls task → IN_PROGRESS
t=30s  responseTimeoutSeconds expires → TIMED_OUT → retry scheduled
t=40s  Worker finishes (too late, task already terminated)

What happens with a heartbeat every 25s:

t=0s   Worker polls task → IN_PROGRESS
t=25s  Worker: POST IN_PROGRESS, callbackAfterSeconds=25 → clock resets
t=50s  Worker: POST IN_PROGRESS, callbackAfterSeconds=25 → clock resets
...
t=90s  Worker: POST COMPLETED → task done

Hard SLA with `totalTimeoutSeconds`

Use totalTimeoutSeconds when you need a guaranteed upper bound on how long a task can take across all of its retries. This is independent of retryCount — whichever limit is hit first wins.

{
  "name": "sync_crm_record",
  "ownerEmail": "crm@example.com",
  "retryCount": 20,
  "retryLogic": "FIXED",
  "retryDelaySeconds": 5,
  "totalTimeoutSeconds": 120,
  "responseTimeoutSeconds": 15,
  "timeoutPolicy": "TIME_OUT_WF"
}

retryCount: 20 — would normally allow 20 retries. totalTimeoutSeconds: 120 — but if the 2-minute wall-clock budget is consumed first, no more retries are queued and the workflow is failed.

This is useful for SLA-sensitive tasks where you need to know that, regardless of transient failures, the workflow will either succeed or surface as failed within a bounded time window.

Timeline example (retryDelaySeconds=5, totalTimeoutSeconds=30):

t=0s   Attempt 1 → FAILED
t=5s   Attempt 2 → FAILED
t=10s  Attempt 3 → FAILED
t=15s  Attempt 4 → FAILED
t=20s  Attempt 5 → FAILED
t=25s  Attempt 6 → FAILED
t=30s  totalTimeoutSeconds exceeded → workflow FAILED, no more retries
        (10 retries still remained in retryCount)

Thundering herd prevention

When hundreds of tasks fail simultaneously (e.g., a downstream service goes down), all retries are scheduled at the same time. Without jitter, they all hit the recovering service at once. backoffJitterMs spreads them across a time window.

{
  "name": "send_webhook",
  "ownerEmail": "platform@example.com",
  "retryCount": 5,
  "retryLogic": "EXPONENTIAL_BACKOFF",
  "retryDelaySeconds": 1,
  "maxRetryDelaySeconds": 30,
  "backoffJitterMs": 5000,
  "responseTimeoutSeconds": 10,
  "concurrentExecLimit": 200
}

With backoffJitterMs: 5000, 500 tasks that all fail at t=0 will retry at uniformly random times between t=1s and t=6s — spreading the retry load across 5 seconds instead of hitting the service in a single burst.

Choosing the right combination

Scenario	Recommended config
External API with rate limits	`EXPONENTIAL_BACKOFF` + `maxRetryDelaySeconds` + `backoffJitterMs`
Long-running processing job	`responseTimeoutSeconds` (short) + heartbeats from worker + `timeoutSeconds` (long)
SLA-bounded task	`totalTimeoutSeconds` + `FIXED` or `EXPONENTIAL_BACKOFF`
High fan-out with many concurrent failures	`backoffJitterMs` + `concurrentExecLimit`
Non-retryable error	Return `FAILED_WITH_TERMINAL_ERROR` from the worker

See the Task Definition reference for all available parameters.