11. Error Handling & Retry Patterns

Error Handling & Retry Patterns — Cross-Cutting Theme#

Before diving into individual services, it’s worth internalizing a set of patterns that appear repeatedly across SQS, SNS, Lambda, API Gateway, DynamoDB, Kinesis, Step Functions, and EventBridge. The exam tests these concepts in context — meaning you’ll see them embedded in scenario questions about specific services — so understanding them as a unified mental model first will save you significant effort later.

The core problem these patterns solve: distributed systems fail partially and unpredictably. A network timeout doesn’t tell you whether the operation succeeded or not. A downstream service may be temporarily overloaded. Your code needs a principled strategy for retrying, giving up gracefully, and not making things worse.

Exponential Backoff with Jitter#

When a call fails, the naive response is to retry immediately. This is usually wrong: if the downstream service is struggling under load, hammering it with immediate retries from thousands of clients simultaneously creates a thundering herd — a spike of traffic that prevents recovery.

Exponential backoff addresses this by increasing the wait time between retries geometrically: 1s → 2s → 4s → 8s → … The AWS SDKs implement this automatically for retryable errors (throttling, transient failures) 🔗.

Jitter adds a random offset to each backoff interval. Without it, clients that all started failing at the same moment will still retry in synchronized waves even with backoff. With jitter, retries are spread across time, smoothing the load. AWS explicitly recommends the “Full Jitter” or “Decorrelated Jitter” strategies 🔗.

In practice: the AWS SDK handles this for you at the HTTP client level, but when building retry logic inside your own application code (e.g., a Lambda retrying a DynamoDB write in a loop), you are responsible for implementing it correctly.

Idempotency#

An operation is idempotent if performing it multiple times produces the same result as performing it once. This is the property that makes retry-safe systems possible — if you can’t be certain whether a previous attempt succeeded, you need to be able to safely try again.

A PUT /items/{id} that replaces the item is idempotent. A POST /items that creates a new item is not — retrying it creates duplicates.
A DynamoDB PutItem is idempotent if the item content is deterministic. A UpdateItem that increments a counter is not.

Designing for idempotency typically means using client-generated idempotency keys (a UUID generated by the caller and stored with the request) so that the receiver can detect and ignore duplicate submissions. Several AWS services have built-in idempotency support: Lambda 🔗, API Gateway 🔗, and SQS FIFO queues (via deduplication IDs) all expose this concept directly.

Dead-Letter Queues (DLQ)#

A Dead-Letter Queue is a secondary queue (or destination) where a message or event is sent after it has failed processing a configured number of times. Rather than losing the message or blocking the queue indefinitely, the system isolates the failure so it can be inspected, reprocessed, or alerted on.

DLQs appear across multiple services, each with slightly different configuration:

SQS — A DLQ is itself an SQS queue. After a message is received and not deleted maxReceiveCount times, it is moved to the DLQ. Standard queues can use Standard DLQs; FIFO queues require a FIFO DLQ. 🔗
SNS — DLQs are configured per-subscription, not per-topic. If SNS can’t deliver to a subscriber (e.g., an HTTP endpoint is down), after exhausting retries it sends the message to the subscription’s DLQ. 🔗
Lambda (event source mappings) — When Lambda fails to process a batch from SQS or Kinesis, the retry behavior differs by trigger. For async invocations (e.g., triggered by SNS or S3), Lambda can be configured with an OnFailure destination (an SQS queue or SNS topic) that captures failed event payloads. 🔗
EventBridge — Rules and Pipes support DLQ configuration for events that fail to reach their target after retries. 🔗
Step Functions — Does not use DLQs directly; error handling is built into the state machine via Catch and Retry blocks (covered in the Step Functions section).

A common exam pattern: a question describes messages disappearing silently — the correct answer almost always involves configuring a DLQ so failures are visible and recoverable rather than lost.

At-Least-Once vs Exactly-Once Delivery#

This distinction matters because it determines what guarantees you can rely on without building additional logic yourself.

At-least-once delivery means the system guarantees a message will be delivered, but it may be delivered more than once (e.g., due to a retry after a timeout). SQS Standard queues operate this way. Your consumer must handle duplicate messages gracefully — which brings you back to idempotency.

Exactly-once delivery means each message is delivered precisely one time, eliminating duplicates at the infrastructure level. SQS FIFO queues provide exactly-once processing within a 5-minute deduplication window, using a MessageDeduplicationId. 🔗

Kinesis Data Streams, by contrast, operates at-least-once — the Kinesis Client Library (KCL) checkpoints progress, but network issues can cause records to be delivered more than once. Step Functions Standard Workflows guarantee exactly-once execution of each state. Express Workflows are at-least-once.

The practical takeaway: default to designing consumers as idempotent, regardless of the delivery guarantee. Even services that claim exactly-once delivery have edge cases, and idempotent consumers are robust by definition.