Skip to content

Retry policies

By default a failed run is final. A retry policy lets a failed run be automatically re-attempted: a named, reusable, account-global resource that decides whether, how many times, how long apart, and on which failures a run is retried. Reference a policy from a job and any matching failure spawns a fresh attempt; the retried runs carry trigger = RETRY (distinct from the manual rerun action). A job that references no policy is never retried — so a job that sets nothing behaves exactly as it did before retries existed.

A policy is identified by a customer-supplied id you choose on create. Its attributes are:

FieldNotes
nameRequired, 1–200 characters.
max_retriesHow many times a failed run is retried after the initial attempt — max_retries of 3 means up to 4 attempts in total. 0 disables retries. Range 010.
backoffHow the wait between retries grows: fixed waits delay_seconds before every retry; exponential doubles the wait each time (delay_seconds, then , , …), capped at max_delay_seconds.
delay_secondsThe wait before a retry, in seconds (≥ 1). For fixed it is the constant wait; for exponential it is the base wait that doubles each retry.
max_delay_secondsThe ceiling on the wait between retries, in seconds — only valid with exponential backoff; omit it for fixed.
retry_on_timeoutRetry a run that did not complete within the job's timeout. Boolean; defaults to false.
retry_on_connection_errorRetry a run whose destination could not be reached (DNS, refused connection, TLS, or transport error). Boolean; defaults to false.
retry_statusesAllowlist of response status patterns to retry when a run failed because the response did not match the job's success status. Each element is an exact 3-digit code ("429") or a class ("5xx"). Empty (the default) matches nothing.
retry_statuses_exceptPatterns subtracted from retry_statuses, using the same syntax — except wins on overlap. Empty (the default) subtracts nothing.

Each match field carries a neutral identity, so a field you omit does nothing — a policy retries exactly the failures you opt into:

  • retry_on_timeout and retry_on_connection_error toggle retries for a timed-out run and an unreachable destination, respectively.
  • retry_statuses is an allowlist of statuses to retry on a non-success response. Each element is either an exact 3-digit HTTP code ("429") or a status class written Nxx — one of "1xx", "2xx", "3xx", "4xx", "5xx". Statuses are strings, and class tokens are case-insensitive ("5XX" is stored as "5xx"). An empty list matches nothing, so nothing is retried on a non-success response.
  • retry_statuses_except subtracts from retry_statuses using the same exact-code-or-class syntax. When a status matches both lists, except wins and the run is not retried — so retry_statuses of ["5xx"] with retry_statuses_except of ["501"] retries every server error except 501. An element that does not overlap retry_statuses is allowed and simply has no effect.

A status is retried only when it matches retry_statuses and is not excluded by retry_statuses_except:

Status pattern membershipResult
In retry_statuses, not in retry_statuses_exceptRetried
In both listsNot retried (except wins)
In neither listNot retried (an allowlist match is required)
retry_statuses emptyNo non-success response is retried

Some failures are never retried regardless of the policy.

A policy is created with POST /api/v1/retry-policies, supplying your chosen id:

POST /api/v1/retry-policies
Content-Type: application/vnd.api+json
Authorization: Bearer <api-key>

{
  "data": {
    "id": "retry-on-5xx",
    "type": "retry_policy",
    "attributes": {
      "name": "Retry on server errors",
      "max_retries": 5,
      "backoff": "exponential",
      "delay_seconds": 2,
      "max_delay_seconds": 60,
      "retry_on_timeout": true,
      "retry_on_connection_error": true,
      "retry_statuses": ["429", "5xx"],
      "retry_statuses_except": ["501"]
    }
  }
}

A suggested starting policy

If you're not sure where to begin, this policy retries the failures that are usually worth retrying — timeouts, unreachable destinations, rate limits (429), and server errors (5xx) — with exponential backoff. Copy it, give it an id, and adjust to taste; nothing here is a default, so you stay in full control.

json
{
  "max_retries": 5,
  "backoff": "exponential",
  "delay_seconds": 2,
  "max_delay_seconds": 60,
  "retry_on_timeout": true,
  "retry_on_connection_error": true,
  "retry_statuses": ["429", "5xx"],
  "retry_statuses_except": []
}

Migrating from retry_on

Earlier the match rule was a single retry_on object — { "statuses": [int], "reasons": [...] }. It has been replaced by the four fields above. To translate an existing policy: integer statuses become string retry_statuses (which now also accept Nxx classes); reasons: ["TIMEOUT"] becomes retry_on_timeout: true; reasons: ["CONNECTION_ERROR"] becomes retry_on_connection_error: true; and the old catch-all reasons: ["NON_SUCCESS_STATUS"] (retry every non-success response) becomes the classes you want, e.g. retry_statuses: ["1xx", "3xx", "4xx", "5xx"].

EndpointPurpose
POST /api/v1/retry-policiesCreate a policy (caller-supplied id).
GET /api/v1/retry-policiesList policies (page paginated; filter[name] for a case-insensitive substring match on the name).
GET /api/v1/retry-policies/{id}Read a policy.
PUT /api/v1/retry-policies/{id}Replace a policy.
DELETE /api/v1/retry-policies/{id}Soft-delete a policy.

Updates follow the standard get-mutate-put pattern: read the policy, change the fields you want, and PUT the full representation back.

Attaching a policy to a job

Set the job's retry_policy attribute to a policy id, and override it per environment with the same key inside an environments entry — exactly parallel to the per-environment schedule, timezone, and request-leaf overrides:

json
{
  "retry_policy": "retry-on-5xx",
  "environments": {
    "production": { "enabled": true },
    "staging":    { "enabled": true, "retry_policy": "no-retries" }
  }
}

A per-environment retry_policy that is omitted inherits the job's base retry_policy; an absent base means the job references no policy and is never retried. A per-environment override must name a real policy id — there is no null override to opt a single environment out of retries, so to make one environment not retry while another does, create a zero-retry policy (max_retries: 0) and reference it, as staging does here with no-retries. Production inherits retry-on-5xx while staging uses the zero-retry no-retries policy.