Retries & Failures
Overview
Section titled “Overview”This section defines how failed publish attempts are handled and how retry behavior is applied.
Retry behavior is part of the processing lifecycle but is governed by implementation-defined policies.
Retry Principle
Section titled “Retry Principle”Failed publish attempts MUST result in a retry unless termination conditions are met.
Retries occur within the normal processing lifecycle.
Retry Scheduling
Section titled “Retry Scheduling”Implementations MAY apply retry scheduling strategies, including:
- immediate retry
- delayed retry using
available_at - backoff strategies (e.g., exponential, linear)
When delay is applied:
available_atSHOULD be updated to control when the event becomes eligible again
Retry Policies
Section titled “Retry Policies”Retry policies MUST be defined by the implementation.
Policies MAY include:
- maximum number of attempts
- time-based retry limits
- backoff strategies
- prioritization rules
Retry policies MUST NOT violate:
- delivery semantics
- processing lifecycle rules
Failure Classification
Section titled “Failure Classification”Implementations MAY classify failures to determine retry behavior.
Examples include:
- transient failures (e.g., network issues, temporary broker unavailability)
- permanent failures (e.g., invalid payload, schema mismatch)
Classification MAY influence:
- retry timing
- termination decisions
Termination Conditions
Section titled “Termination Conditions”Termination conditions define when an event is no longer eligible for automatic retry.
When termination conditions are met:
- the event MUST transition to
DEAD - the event MUST NOT be retried automatically
Termination conditions are implementation-defined.
Dead Events
Section titled “Dead Events”Events in the DEAD state:
- MUST NOT be retried automatically
- MAY require operator intervention
- MAY be replayed
Operator Intervention
Section titled “Operator Intervention”Operators MAY:
- inspect failed or dead events
- trigger replay of events
- override retry behavior
Operator actions MUST respect:
- state transition rules
- delivery semantics
Backoff and Scheduling
Section titled “Backoff and Scheduling”Implementations MAY use backoff strategies to avoid overwhelming the target system.
Backoff strategies SHOULD:
- reduce retry frequency over time
- prevent tight retry loops
Failure Visibility
Section titled “Failure Visibility”Implementations SHOULD provide visibility into failures, including:
- error details (
last_error) - retry attempts (
attempts) - timing (
available_at,claimed_at)
This information SHOULD support debugging and operational monitoring.
Idempotency Considerations
Section titled “Idempotency Considerations”Since retries may result in duplicate delivery:
- consumers SHOULD be idempotent
- implementations MAY use
event_idfor deduplication
Guarantees
Section titled “Guarantees”Retry behavior MUST preserve:
- delivery semantics
- state transition correctness
- event durability