Skip to content

Retries & Failures

This section defines how failed publish attempts are handled and how retry behavior is applied.

Retry behavior is part of the processing lifecycle but is governed by implementation-defined policies.


Failed publish attempts MUST result in a retry unless termination conditions are met.

Retries occur within the normal processing lifecycle.


Implementations MAY apply retry scheduling strategies, including:

  • immediate retry
  • delayed retry using available_at
  • backoff strategies (e.g., exponential, linear)

When delay is applied:

  • available_at SHOULD be updated to control when the event becomes eligible again

Retry policies MUST be defined by the implementation.

Policies MAY include:

  • maximum number of attempts
  • time-based retry limits
  • backoff strategies
  • prioritization rules

Retry policies MUST NOT violate:

  • delivery semantics
  • processing lifecycle rules

Implementations MAY classify failures to determine retry behavior.

Examples include:

  • transient failures (e.g., network issues, temporary broker unavailability)
  • permanent failures (e.g., invalid payload, schema mismatch)

Classification MAY influence:

  • retry timing
  • termination decisions

Termination conditions define when an event is no longer eligible for automatic retry.

When termination conditions are met:

  • the event MUST transition to DEAD
  • the event MUST NOT be retried automatically

Termination conditions are implementation-defined.


Events in the DEAD state:

  • MUST NOT be retried automatically
  • MAY require operator intervention
  • MAY be replayed

Operators MAY:

  • inspect failed or dead events
  • trigger replay of events
  • override retry behavior

Operator actions MUST respect:

  • state transition rules
  • delivery semantics

Implementations MAY use backoff strategies to avoid overwhelming the target system.

Backoff strategies SHOULD:

  • reduce retry frequency over time
  • prevent tight retry loops

Implementations SHOULD provide visibility into failures, including:

  • error details (last_error)
  • retry attempts (attempts)
  • timing (available_at, claimed_at)

This information SHOULD support debugging and operational monitoring.


Since retries may result in duplicate delivery:

  • consumers SHOULD be idempotent
  • implementations MAY use event_id for deduplication

Retry behavior MUST preserve:

  • delivery semantics
  • state transition correctness
  • event durability