Task timeouts and retries
Practical recipes for making workers resilient. Each recipe is a complete task definition you can register with POST /api/metadata/taskdefs.
Exponential backoff with a cap
Retries with exponential backoff for a task that calls an external API. The cap prevents the delay from growing indefinitely; jitter prevents multiple failing workers from hammering the API at the same time.
{
"name": "call_payment_api",
"ownerEmail": "payments@example.com",
"retryCount": 6,
"retryLogic": "EXPONENTIAL_BACKOFF",
"retryDelaySeconds": 2,
"maxRetryDelaySeconds": 60,
"backoffJitterMs": 3000,
"responseTimeoutSeconds": 30,
"timeoutSeconds": 600,
"timeoutPolicy": "RETRY"
}
Delay schedule (retryDelaySeconds=2, maxRetryDelaySeconds=60, backoffJitterMs=3000):
| Attempt | Base delay | After cap | Actual range |
|---|---|---|---|
| 1 | 2s | 2s | 2.0 – 5.0s |
| 2 | 4s | 4s | 4.0 – 7.0s |
| 3 | 8s | 8s | 8.0 – 11.0s |
| 4 | 16s | 16s | 16.0 – 19.0s |
| 5 | 32s | 32s | 32.0 – 35.0s |
| 6 | 64s | 60s | 60.0 – 63.0s |
Lease extension for long-running workers
responseTimeoutSeconds is the heartbeat window: if the worker doesn't report back within this duration, Conductor marks the task TIMED_OUT and retries it. For tasks that take longer than the heartbeat window, workers extend the lease by posting an IN_PROGRESS update with callbackAfterSeconds.
Task definition
{
"name": "transcode_video",
"ownerEmail": "media@example.com",
"retryCount": 2,
"retryLogic": "FIXED",
"retryDelaySeconds": 10,
"responseTimeoutSeconds": 30,
"timeoutSeconds": 3600,
"timeoutPolicy": "RETRY"
}
responseTimeoutSeconds: 30 — Conductor will reschedule the task if the worker is silent for 30 seconds.
timeoutSeconds: 3600 — the task itself can take up to 1 hour across all heartbeats.
Worker: extend the lease every 25 seconds
import time
from conductor.client.http.models import TaskResult
def transcode_video(task):
task_id = task.task_id
workflow_id = task.workflow_instance_id
for chunk in video_chunks(task.input_data["file_url"]):
transcode_chunk(chunk)
# Extend the lease before responseTimeoutSeconds (30s) expires.
# callbackAfterSeconds tells Conductor to leave this task invisible
# in the queue for another 25s — resetting the response clock.
heartbeat = TaskResult(
task_id=task_id,
workflow_instance_id=workflow_id,
status="IN_PROGRESS",
callback_after_seconds=25,
output_data={"progress": chunk.index / len(video_chunks)}
)
conductor_client.update_task(heartbeat)
return TaskResult(
task_id=task_id,
workflow_instance_id=workflow_id,
status="COMPLETED",
output_data={"output_url": upload_result.url}
)
What happens without a heartbeat:
t=0s Worker polls task → IN_PROGRESS
t=30s responseTimeoutSeconds expires → TIMED_OUT → retry scheduled
t=40s Worker finishes (too late, task already terminated)
What happens with a heartbeat every 25s:
t=0s Worker polls task → IN_PROGRESS
t=25s Worker: POST IN_PROGRESS, callbackAfterSeconds=25 → clock resets
t=50s Worker: POST IN_PROGRESS, callbackAfterSeconds=25 → clock resets
...
t=90s Worker: POST COMPLETED → task done
Hard SLA with totalTimeoutSeconds
Use totalTimeoutSeconds when you need a guaranteed upper bound on how long a task can take across all of its retries. This is independent of retryCount — whichever limit is hit first wins.
{
"name": "sync_crm_record",
"ownerEmail": "crm@example.com",
"retryCount": 20,
"retryLogic": "FIXED",
"retryDelaySeconds": 5,
"totalTimeoutSeconds": 120,
"responseTimeoutSeconds": 15,
"timeoutPolicy": "TIME_OUT_WF"
}
retryCount: 20 — would normally allow 20 retries.
totalTimeoutSeconds: 120 — but if the 2-minute wall-clock budget is consumed first, no more retries are queued and the workflow is failed.
This is useful for SLA-sensitive tasks where you need to know that, regardless of transient failures, the workflow will either succeed or surface as failed within a bounded time window.
Timeline example (retryDelaySeconds=5, totalTimeoutSeconds=30):
t=0s Attempt 1 → FAILED
t=5s Attempt 2 → FAILED
t=10s Attempt 3 → FAILED
t=15s Attempt 4 → FAILED
t=20s Attempt 5 → FAILED
t=25s Attempt 6 → FAILED
t=30s totalTimeoutSeconds exceeded → workflow FAILED, no more retries
(10 retries still remained in retryCount)
Thundering herd prevention
When hundreds of tasks fail simultaneously (e.g., a downstream service goes down), all retries are scheduled at the same time. Without jitter, they all hit the recovering service at once. backoffJitterMs spreads them across a time window.
{
"name": "send_webhook",
"ownerEmail": "platform@example.com",
"retryCount": 5,
"retryLogic": "EXPONENTIAL_BACKOFF",
"retryDelaySeconds": 1,
"maxRetryDelaySeconds": 30,
"backoffJitterMs": 5000,
"responseTimeoutSeconds": 10,
"concurrentExecLimit": 200
}
With backoffJitterMs: 5000, 500 tasks that all fail at t=0 will retry at uniformly random times between t=1s and t=6s — spreading the retry load across 5 seconds instead of hitting the service in a single burst.
Choosing the right combination
| Scenario | Recommended config |
|---|---|
| External API with rate limits | EXPONENTIAL_BACKOFF + maxRetryDelaySeconds + backoffJitterMs |
| Long-running processing job | responseTimeoutSeconds (short) + heartbeats from worker + timeoutSeconds (long) |
| SLA-bounded task | totalTimeoutSeconds + FIXED or EXPONENTIAL_BACKOFF |
| High fan-out with many concurrent failures | backoffJitterMs + concurrentExecLimit |
| Non-retryable error | Return FAILED_WITH_TERMINAL_ERROR from the worker |
See the Task Definition reference for all available parameters.