Retry Configuration Guide
Automatically retry failed steps with configurable backoff strategies and exit code filtering.
Overview
AWF provides built-in retry functionality for steps and agent calls. When a step fails, you can configure AWF to automatically retry with exponential, linear, or constant backoff delays.
Common use cases:
- Transient network errors (429, 502, 503 responses)
- Intermittent service failures
- Rate-limited API calls
- Flaky shell commands
Basic Retry
The simplest retry configuration retries a step multiple times with default settings:
states:
initial: fetch_data
fetch_data:
type: step
command: curl https://api.example.com/data
retry:
max_attempts: 3 # Try 3 times total (default: 1 = no retry)
on_success: done
done:
type: terminal
status: successWith this configuration:
curlexecutes- If it fails (non-zero exit), AWF retries up to 2 more times
- Each retry executes immediately (no delay)
- If all 3 attempts fail, the step is considered failed
Adding Delays
Use initial_delay to add a delay before the first retry:
fetch_data:
type: step
command: curl https://api.example.com/data
retry:
max_attempts: 3
initial_delay: 1s # Wait 1 second before first retry
backoff: constant # Always wait 1 second between attempts
on_success: doneDuration format accepts Go duration strings:
100ms— milliseconds1s— 1 second30s— 30 seconds1m30s— 1.5 minutes
Backoff Strategies
Constant Backoff
Retry with a fixed delay:
retry:
max_attempts: 5
initial_delay: 2s
backoff: constantDelays: 2s, 2s, 2s, 2s (always the same)
Linear Backoff
Delay increases linearly with each attempt:
retry:
max_attempts: 5
initial_delay: 1s
backoff: linearDelays: 1s, 2s, 3s, 4s (multiplied by attempt number)
Exponential Backoff
Delay increases exponentially (recommended for most use cases):
retry:
max_attempts: 5
initial_delay: 1s
backoff: exponential
multiplier: 2 # Double the delay each time (default: 2.0)Delays: 1s, 2s, 4s, 8s (multiplied by 2 each time)
Using a different multiplier:
retry:
max_attempts: 5
initial_delay: 500ms
backoff: exponential
multiplier: 1.5 # Increase delay by 50% each timeDelays: 500ms, 750ms, 1.125s, 1.687s
Capping Maximum Delay
Prevent delays from growing too large with max_delay:
retry:
max_attempts: 10
initial_delay: 1s
backoff: exponential
multiplier: 2
max_delay: 30s # Never wait longer than 30 secondsThis configuration:
- Starts with 1 second delays
- Doubles each time: 2s, 4s, 8s, 16s, 30s (capped), 30s, 30s, 30s, 30s
Important: Always specify max_delay to prevent excessively long delays in production.
Filtering Retryable Exit Codes
By default, AWF retries on any non-zero exit code. Use retryable_exit_codes to retry only specific failures:
deploy:
type: step
command: ./deploy.sh
retry:
max_attempts: 3
initial_delay: 5s
backoff: exponential
retryable_exit_codes: [1, 22] # Only retry on exit codes 1 and 22
on_success: verifyWith this configuration:
- Exit code
1(transient error) → retry - Exit code
22(connection error) → retry - Exit code
5(invalid config) → fail immediately, don’t retry
Empty array (the default) retries all non-zero codes:
retry:
max_attempts: 3
retryable_exit_codes: [] # Retry on any non-zero exitAgent Step Retry
Retry agent steps the same way you retry command steps:
analyze:
type: agent
provider: claude
prompt: "Analyze: {{.inputs.code}}"
timeout: 120
retry:
max_attempts: 3
initial_delay: 2s
backoff: exponential
on_success: doneHTTP Operation Retry
For HTTP operations (REST API calls), AWF retries based on status codes:
api_call:
type: operation
operation: http.request
inputs:
method: POST
url: https://api.example.com/process
body: "{{.inputs.data}}"
retryable_status_codes: [429, 502, 503] # Retry on rate limit or server error
retry:
max_attempts: 5
initial_delay: 1s
backoff: exponential
multiplier: 2
max_delay: 60s
on_success: nextComplete Example: Reliable API Integration
This example shows a robust API integration with retry, error handling, and logging:
name: reliable-api
version: "1.0.0"
inputs:
- name: endpoint
type: string
required: true
default: "https://api.example.com"
states:
initial: fetch_with_retry
fetch_with_retry:
type: operation
operation: http.request
inputs:
method: GET
url: "{{.inputs.endpoint}}/data"
timeout: 30
retryable_status_codes: [429, 502, 503, 504]
retry:
max_attempts: 5
initial_delay: 1s
backoff: exponential
multiplier: 2
max_delay: 32s
on_success: process
on_failure:
message: "API call failed after 5 attempts: {{.error.message}}"
status: 3
process:
type: agent
provider: claude
prompt: "Process this JSON: {{.states.fetch_with_retry.output}}"
retry:
max_attempts: 2
initial_delay: 2s
backoff: constant
on_success: done
done:
type: terminal
status: successValidation Rules
AWF validates retry configurations to catch mistakes early:
| Rule | Error |
|---|---|
max_attempts < 1 | max_attempts must be at least 1 |
initial_delay invalid | invalid initial_delay: expected duration string |
max_delay invalid | invalid max_delay: expected duration string |
Unknown backoff | invalid backoff strategy: use constant, linear, or exponential |
jitter outside [0, 1] | jitter must be between 0.0 and 1.0 |
multiplier < 0 | multiplier must be non-negative |
Example error:
$ awf run my-workflow
ERROR validating workflow: step 'fetch': invalid max_attempts: 0Common Patterns
Circuit Breaker (Give Up After Repeated Failures)
Use step transitions to skip retries after a threshold:
deploy:
type: step
command: ./deploy.sh
retry:
max_attempts: 3
initial_delay: 5s
backoff: exponential
on_success: verify
on_failure: alert_ops
alert_ops:
type: terminal
message: "Deployment failed after 3 attempts. Manual intervention required."
status: 2Jitter (Randomize Delays to Avoid Thundering Herd)
For distributed systems where many clients retry simultaneously, add randomization:
retry:
max_attempts: 5
initial_delay: 1s
backoff: exponential
multiplier: 2
jitter: 0.5 # Add ±50% randomness to each delayThis prevents multiple clients from retrying at exactly the same time, which can overwhelm the service.
Escalating Delays
For critical operations, increase delays over multiple retries:
critical_task:
type: step
command: ./critical-operation.sh
retry:
max_attempts: 10
initial_delay: 500ms
backoff: exponential
multiplier: 1.5
max_delay: 5m # Cap at 5 minutes
on_success: doneWith multiplier: 1.5:
- 500ms
- 750ms
- 1.125s
- 1.687s
- 2.531s … eventually capped at 5m
Troubleshooting
Retries Not Happening
Problem: Your step never retries even though it fails.
Causes:
max_attemptsnot specified (defaults to 1 = no retry)- Exit code not in
retryable_exit_codeslist
Solution:
# Add explicit retry configuration
retry:
max_attempts: 3
initial_delay: 1sDelays Too Long
Problem: Retries take forever.
Causes:
max_delaynot specified on exponential backoffmax_attemptsset too high
Solution:
retry:
max_attempts: 5 # Reasonable limit
initial_delay: 1s
backoff: exponential
max_delay: 30s # Always cap exponential backoffSome Failures Not Retrying
Problem: Step fails on certain errors but doesn’t retry.
Causes:
- Exit code not in
retryable_exit_codeslist retryable_exit_codestoo restrictive
Solution:
# Check which exit code your command produces
$ ./my-script.sh; echo "Exit code: $?"
# Then add it to retryable_exit_codes
retry:
retryable_exit_codes: [1, 22, 35]Reference
Retry Configuration
| Field | Type | Default | Description |
|---|---|---|---|
max_attempts | int | 1 | Maximum number of attempts (1 = no retry) |
initial_delay | duration | 0 | Delay before first retry |
max_delay | duration | unlimited | Maximum delay between retries |
backoff | string | constant | Strategy: constant, linear, exponential |
multiplier | float | 2.0 | Multiplier for exponential backoff |
jitter | float | 0.0 | Randomness factor (0.0-1.0) |
retryable_exit_codes | array | all | Exit codes to retry (empty = all non-zero) |
Backoff Formulas
| Strategy | Formula |
|---|---|
| Constant | initial_delay |
| Linear | initial_delay × attempt_number |
| Exponential | initial_delay × multiplier^(attempt_number-1) |
See Also
- Workflow Syntax Reference — Complete YAML syntax
- Agent Steps Guide — Retry for AI operations
- HTTP Operations — Retry for REST APIs