Automatically retry failed steps with configurable backoff strategies and exit code filtering.

Overview

AWF provides built-in retry functionality for steps and agent calls. When a step fails, you can configure AWF to automatically retry with exponential, linear, or constant backoff delays.

Common use cases:

  • Transient network errors (429, 502, 503 responses)
  • Intermittent service failures
  • Rate-limited API calls
  • Flaky shell commands

Basic Retry

The simplest retry configuration retries a step multiple times with default settings:

states:
  initial: fetch_data

  fetch_data:
    type: step
    command: curl https://api.example.com/data
    retry:
      max_attempts: 3  # Try 3 times total (default: 1 = no retry)
    on_success: done

  done:
    type: terminal
    status: success

With this configuration:

  1. curl executes
  2. If it fails (non-zero exit), AWF retries up to 2 more times
  3. Each retry executes immediately (no delay)
  4. If all 3 attempts fail, the step is considered failed

Adding Delays

Use initial_delay to add a delay before the first retry:

fetch_data:
  type: step
  command: curl https://api.example.com/data
  retry:
    max_attempts: 3
    initial_delay: 1s      # Wait 1 second before first retry
    backoff: constant      # Always wait 1 second between attempts
  on_success: done

Duration format accepts Go duration strings:

  • 100ms — milliseconds
  • 1s — 1 second
  • 30s — 30 seconds
  • 1m30s — 1.5 minutes

Backoff Strategies

Constant Backoff

Retry with a fixed delay:

retry:
  max_attempts: 5
  initial_delay: 2s
  backoff: constant

Delays: 2s, 2s, 2s, 2s (always the same)

Linear Backoff

Delay increases linearly with each attempt:

retry:
  max_attempts: 5
  initial_delay: 1s
  backoff: linear

Delays: 1s, 2s, 3s, 4s (multiplied by attempt number)

Exponential Backoff

Delay increases exponentially (recommended for most use cases):

retry:
  max_attempts: 5
  initial_delay: 1s
  backoff: exponential
  multiplier: 2        # Double the delay each time (default: 2.0)

Delays: 1s, 2s, 4s, 8s (multiplied by 2 each time)

Using a different multiplier:

retry:
  max_attempts: 5
  initial_delay: 500ms
  backoff: exponential
  multiplier: 1.5      # Increase delay by 50% each time

Delays: 500ms, 750ms, 1.125s, 1.687s

Capping Maximum Delay

Prevent delays from growing too large with max_delay:

retry:
  max_attempts: 10
  initial_delay: 1s
  backoff: exponential
  multiplier: 2
  max_delay: 30s       # Never wait longer than 30 seconds

This configuration:

  • Starts with 1 second delays
  • Doubles each time: 2s, 4s, 8s, 16s, 30s (capped), 30s, 30s, 30s, 30s

Important: Always specify max_delay to prevent excessively long delays in production.

Filtering Retryable Exit Codes

By default, AWF retries on any non-zero exit code. Use retryable_exit_codes to retry only specific failures:

deploy:
  type: step
  command: ./deploy.sh
  retry:
    max_attempts: 3
    initial_delay: 5s
    backoff: exponential
    retryable_exit_codes: [1, 22]  # Only retry on exit codes 1 and 22
  on_success: verify

With this configuration:

  • Exit code 1 (transient error) → retry
  • Exit code 22 (connection error) → retry
  • Exit code 5 (invalid config) → fail immediately, don’t retry

Empty array (the default) retries all non-zero codes:

retry:
  max_attempts: 3
  retryable_exit_codes: []        # Retry on any non-zero exit

Agent Step Retry

Retry agent steps the same way you retry command steps:

analyze:
  type: agent
  provider: claude
  prompt: "Analyze: {{.inputs.code}}"
  timeout: 120
  retry:
    max_attempts: 3
    initial_delay: 2s
    backoff: exponential
  on_success: done

HTTP Operation Retry

For HTTP operations (REST API calls), AWF retries based on status codes:

api_call:
  type: operation
  operation: http.request
  inputs:
    method: POST
    url: https://api.example.com/process
    body: "{{.inputs.data}}"
    retryable_status_codes: [429, 502, 503]  # Retry on rate limit or server error
  retry:
    max_attempts: 5
    initial_delay: 1s
    backoff: exponential
    multiplier: 2
    max_delay: 60s
  on_success: next

Complete Example: Reliable API Integration

This example shows a robust API integration with retry, error handling, and logging:

name: reliable-api
version: "1.0.0"

inputs:
  - name: endpoint
    type: string
    required: true
    default: "https://api.example.com"

states:
  initial: fetch_with_retry

  fetch_with_retry:
    type: operation
    operation: http.request
    inputs:
      method: GET
      url: "{{.inputs.endpoint}}/data"
      timeout: 30
      retryable_status_codes: [429, 502, 503, 504]
    retry:
      max_attempts: 5
      initial_delay: 1s
      backoff: exponential
      multiplier: 2
      max_delay: 32s
    on_success: process
    on_failure:
      message: "API call failed after 5 attempts: {{.error.message}}"
      status: 3

  process:
    type: agent
    provider: claude
    prompt: "Process this JSON: {{.states.fetch_with_retry.output}}"
    retry:
      max_attempts: 2
      initial_delay: 2s
      backoff: constant
    on_success: done

  done:
    type: terminal
    status: success

Validation Rules

AWF validates retry configurations to catch mistakes early:

RuleError
max_attempts < 1max_attempts must be at least 1
initial_delay invalidinvalid initial_delay: expected duration string
max_delay invalidinvalid max_delay: expected duration string
Unknown backoffinvalid backoff strategy: use constant, linear, or exponential
jitter outside [0, 1]jitter must be between 0.0 and 1.0
multiplier < 0multiplier must be non-negative

Example error:

$ awf run my-workflow
ERROR validating workflow: step 'fetch': invalid max_attempts: 0

Common Patterns

Circuit Breaker (Give Up After Repeated Failures)

Use step transitions to skip retries after a threshold:

deploy:
  type: step
  command: ./deploy.sh
  retry:
    max_attempts: 3
    initial_delay: 5s
    backoff: exponential
  on_success: verify
  on_failure: alert_ops

alert_ops:
  type: terminal
  message: "Deployment failed after 3 attempts. Manual intervention required."
  status: 2

Jitter (Randomize Delays to Avoid Thundering Herd)

For distributed systems where many clients retry simultaneously, add randomization:

retry:
  max_attempts: 5
  initial_delay: 1s
  backoff: exponential
  multiplier: 2
  jitter: 0.5              # Add ±50% randomness to each delay

This prevents multiple clients from retrying at exactly the same time, which can overwhelm the service.

Escalating Delays

For critical operations, increase delays over multiple retries:

critical_task:
  type: step
  command: ./critical-operation.sh
  retry:
    max_attempts: 10
    initial_delay: 500ms
    backoff: exponential
    multiplier: 1.5
    max_delay: 5m           # Cap at 5 minutes
  on_success: done

With multiplier: 1.5:

  1. 500ms
  2. 750ms
  3. 1.125s
  4. 1.687s
  5. 2.531s … eventually capped at 5m

Troubleshooting

Retries Not Happening

Problem: Your step never retries even though it fails.

Causes:

  1. max_attempts not specified (defaults to 1 = no retry)
  2. Exit code not in retryable_exit_codes list

Solution:

# Add explicit retry configuration
retry:
  max_attempts: 3
  initial_delay: 1s

Delays Too Long

Problem: Retries take forever.

Causes:

  1. max_delay not specified on exponential backoff
  2. max_attempts set too high

Solution:

retry:
  max_attempts: 5          # Reasonable limit
  initial_delay: 1s
  backoff: exponential
  max_delay: 30s           # Always cap exponential backoff

Some Failures Not Retrying

Problem: Step fails on certain errors but doesn’t retry.

Causes:

  1. Exit code not in retryable_exit_codes list
  2. retryable_exit_codes too restrictive

Solution:

# Check which exit code your command produces
$ ./my-script.sh; echo "Exit code: $?"

# Then add it to retryable_exit_codes
retry:
  retryable_exit_codes: [1, 22, 35]

Reference

Retry Configuration

FieldTypeDefaultDescription
max_attemptsint1Maximum number of attempts (1 = no retry)
initial_delayduration0Delay before first retry
max_delaydurationunlimitedMaximum delay between retries
backoffstringconstantStrategy: constant, linear, exponential
multiplierfloat2.0Multiplier for exponential backoff
jitterfloat0.0Randomness factor (0.0-1.0)
retryable_exit_codesarrayallExit codes to retry (empty = all non-zero)

Backoff Formulas

StrategyFormula
Constantinitial_delay
Linearinitial_delay × attempt_number
Exponentialinitial_delay × multiplier^(attempt_number-1)

See Also