Webhook Retry Logic and Idempotency: A Developer's Guide

Published Feb 21 202611 min read
Webhook retry logic diagram showing exponential backoff and idempotency key deduplication

Webhook delivery is not guaranteed to succeed on the first attempt. Networks fail, servers restart, deployments cause brief downtime, and bugs cause unexpected errors. That is why every serious webhook system implements retry logic — and why every webhook consumer must implement idempotency. This guide covers both sides of the equation: how providers retry failed deliveries and how you should handle the duplicate events that retries inevitably create.

Why Retries Are Necessary

In a perfect world, every webhook would be delivered successfully on the first attempt. In the real world, temporary failures are common:

  • Your server is restarting during a deployment (30-60 seconds of downtime)
  • A network blip drops the connection between the provider and your endpoint
  • Your endpoint times out because it is under heavy load or performing slow operations
  • DNS resolution fails temporarily due to DNS cache invalidation
  • TLS certificate renewal causes a brief window where connections fail
  • Your application throws an error due to a bug or unexpected payload

Without retries, every one of these scenarios would result in permanently lost events. For critical workflows like payment processing, that is unacceptable. A customer pays for a product, but your system never learns about it because a single HTTP request failed during a 2-second deployment window.

Retry logic transforms webhook delivery from "hope it works" to "guaranteed delivery, eventually."

How Webhook Retry Logic Works

The Retry Flow

When a webhook delivery fails, the provider's retry system follows this general flow:

1

Initial delivery attempt

The provider sends the webhook HTTP POST request to your endpoint. It waits for a response, typically with a 5-30 second timeout.

2

Failure detection

The delivery is considered failed if: the connection times out, the connection is refused, your endpoint returns a non-2xx status code (like 500 or 503), or the response is malformed.

3

Schedule retry with backoff

The provider calculates the next retry time using an exponential backoff algorithm and places the delivery back in the queue.

4

Repeat until success or exhaustion

The retry cycle continues until either your endpoint returns a 2xx response (success) or the maximum number of retries is reached (failure — event goes to dead letter queue or is dropped).

Exponential Backoff Explained

Exponential backoff increases the delay between retry attempts, preventing a flood of requests from overwhelming a recovering server. Here is a typical schedule:

Attempt 1:  Immediate         (original delivery)
Attempt 2:  +1 minute         (total elapsed: 1 min)
Attempt 3:  +5 minutes        (total elapsed: 6 min)
Attempt 4:  +30 minutes       (total elapsed: 36 min)
Attempt 5:  +2 hours          (total elapsed: 2h 36m)
Attempt 6:  +8 hours          (total elapsed: 10h 36m)
Attempt 7:  +24 hours         (total elapsed: 34h 36m)

The formula is typically: delay = min(base_delay * 2^attempt, max_delay) + random_jitter

The random jitter prevents the "thundering herd" problem — if thousands of webhooks fail at the same time (say, during an outage), you do not want them all retrying at exactly the same moments. Adding random jitter (a few seconds of randomness) spreads the retries across time.

Provider-Specific Retry Policies

Different providers have different retry policies:

Stripe: Retries up to 3 days with exponential backoff. After failures, Stripe shows failed events in the dashboard and sends email notifications. You can also retrieve missed events via the Events API.

GitHub: Retries within 24 hours. After 25 total failed attempts, the webhook is automatically disabled. GitHub shows delivery status in the webhook settings page.

Shopify: Retries 19 times over approximately 48 hours. After that, the webhook subscription is removed and must be re-registered.

Twilio: Retries for up to 24 hours with increasing intervals.

Understanding your provider's retry policy helps you design appropriate error handling and monitoring. If a provider gives up after 24 hours, you need to detect failures within that window.

Idempotency: The Consumer's Responsibility

Retries solve the delivery reliability problem but create a new challenge: duplicate events. When a network timeout occurs after your endpoint has already processed the webhook but before the provider receives your 200 response, the provider will retry — and you will receive the same event again.

If your handler charges a customer, sends an email, or creates a database record, processing the same event twice produces incorrect results. This is where idempotency comes in.

What Makes a Handler Idempotent

An idempotent handler produces the same outcome whether the same event is processed once, twice, or a hundred times. Here are the key principles:

Check before acting: Before processing an event, check whether it has already been processed.

Use event IDs as idempotency keys: Every webhook event has a unique identifier. Use it.

Make database operations idempotent: Use upserts instead of inserts where possible.

Implementing Idempotency with Redis

Redis provides fast, in-memory storage ideal for tracking processed event IDs:

const Redis = require('ioredis');
const redis = new Redis(process.env.REDIS_URL);

async function handleWebhook(req, res) {
  const eventId = req.body.id || req.headers['x-webhook-id'];

  if (!eventId) {
    console.error('Webhook missing event ID');
    return res.status(400).json({ error: 'Missing event ID' });
  }

  // Try to set the key — NX means "only if it does not exist"
  // EX sets a TTL of 7 days (604800 seconds)
  const isNew = await redis.set(
    `webhook:processed:${eventId}`,
    Date.now().toString(),
    'EX', 604800,
    'NX'
  );

  if (!isNew) {
    // Event already processed — acknowledge but skip processing
    console.log(`Duplicate webhook skipped: ${eventId}`);
    return res.status(200).json({ received: true, duplicate: true });
  }

  try {
    // Process the event
    await processEvent(req.body);
    res.status(200).json({ received: true });
  } catch (error) {
    // Processing failed — remove the key so the retry can be processed
    await redis.del(`webhook:processed:${eventId}`);
    console.error(`Webhook processing failed: ${eventId}`, error);
    res.status(500).json({ error: 'Processing failed' });
  }
}

Implementing Idempotency with a Database

If you prefer using your primary database instead of Redis:

// PostgreSQL example with a processed_events table
async function handleWebhookWithDB(req, res) {
  const eventId = req.body.id;
  const eventType = req.body.type;

  try {
    // Use a transaction to check and process atomically
    await db.transaction(async (tx) => {
      // Try to insert the event record
      // ON CONFLICT DO NOTHING prevents duplicate inserts
      const result = await tx.query(
        `INSERT INTO processed_events (event_id, event_type, processed_at)
         VALUES ($1, $2, NOW())
         ON CONFLICT (event_id) DO NOTHING
         RETURNING event_id`,
        [eventId, eventType]
      );

      if (result.rows.length === 0) {
        // Event already exists — duplicate
        console.log(`Duplicate webhook: ${eventId}`);
        return;
      }

      // Event is new — process it within the same transaction
      await processEventInTransaction(tx, req.body);
    });

    res.status(200).json({ received: true });
  } catch (error) {
    console.error('Webhook processing error:', error);
    res.status(500).json({ error: 'Processing failed' });
  }
}
-- Table for tracking processed events
CREATE TABLE processed_events (
  event_id VARCHAR(255) PRIMARY KEY,
  event_type VARCHAR(255) NOT NULL,
  processed_at TIMESTAMP NOT NULL DEFAULT NOW(),
  payload JSONB
);

-- Index for cleanup queries
CREATE INDEX idx_processed_events_date ON processed_events (processed_at);

-- Periodic cleanup: remove records older than 30 days
DELETE FROM processed_events WHERE processed_at < NOW() - INTERVAL '30 days';

The atomic check-and-process pattern is critical. If you check for duplicates and process the event in separate steps, a race condition can occur: two identical webhook deliveries arrive simultaneously, both pass the duplicate check (because neither has been processed yet), and both get processed. Use database transactions or Redis atomic operations to prevent this.

Deduplication Strategies

Beyond simple event ID tracking, there are more sophisticated deduplication approaches:

Content-Based Deduplication

When webhooks lack unique event IDs (rare but possible with some providers), you can compute a hash of the payload content:

const crypto = require('crypto');

function computePayloadHash(payload) {
  const normalized = JSON.stringify(payload, Object.keys(payload).sort());
  return crypto.createHash('sha256').update(normalized).digest('hex');
}

async function handleWebhook(req, res) {
  const payloadHash = computePayloadHash(req.body);

  const isDuplicate = await redis.set(
    `webhook:hash:${payloadHash}`,
    '1',
    'EX', 3600, // 1 hour TTL
    'NX'
  );

  if (!isDuplicate) {
    return res.status(200).json({ received: true, duplicate: true });
  }

  await processEvent(req.body);
  res.status(200).json({ received: true });
}

Sequence-Based Deduplication

Some providers include sequence numbers or version fields. You can use these to ensure you always process the latest version and skip stale events:

async function handleOrderUpdate(event) {
  const { order_id, version } = event.data;

  // Only process if this version is newer than what we have
  const result = await db.query(
    `UPDATE orders
     SET status = $1, updated_at = NOW(), event_version = $2
     WHERE id = $3 AND (event_version IS NULL OR event_version < $2)
     RETURNING id`,
    [event.data.status, version, order_id]
  );

  if (result.rows.length === 0) {
    console.log(`Skipping stale event for order ${order_id}: version ${version}`);
  }
}

Idempotent Database Operations

Design your database operations to be naturally idempotent using upserts:

// Instead of INSERT (which fails on duplicate)
await db.query(
  'INSERT INTO orders (id, status, amount) VALUES ($1, $2, $3)',
  [orderId, 'paid', amount]
);

// Use UPSERT (which handles duplicates gracefully)
await db.query(
  `INSERT INTO orders (id, status, amount, updated_at)
   VALUES ($1, $2, $3, NOW())
   ON CONFLICT (id) DO UPDATE
   SET status = EXCLUDED.status,
       amount = EXCLUDED.amount,
       updated_at = NOW()`,
  [orderId, 'paid', amount]
);

Dead Letter Queues

When all retry attempts fail, events need somewhere to go. A dead letter queue (DLQ) captures these failed deliveries for manual review and potential replay.

Implementing a Dead Letter Queue

async function handleWebhookWithDLQ(req, res) {
  try {
    await processEvent(req.body);
    res.status(200).json({ received: true });
  } catch (error) {
    const retryCount = parseInt(req.headers['x-retry-count'] || '0');

    if (retryCount >= 5) {
      // Max retries exceeded — send to dead letter queue
      await sendToDeadLetterQueue({
        event: req.body,
        error: error.message,
        retryCount,
        timestamp: new Date().toISOString(),
        headers: req.headers
      });

      // Return 200 to stop the provider from retrying
      // (we have captured the event in our DLQ)
      res.status(200).json({ received: true, queued_for_review: true });
    } else {
      // Return 500 to trigger provider retry
      res.status(500).json({ error: 'Processing failed, please retry' });
    }
  }
}

async function sendToDeadLetterQueue(failedDelivery) {
  // Option 1: Database table
  await db.query(
    `INSERT INTO webhook_dead_letter_queue
     (event_data, error_message, retry_count, failed_at)
     VALUES ($1, $2, $3, NOW())`,
    [JSON.stringify(failedDelivery.event), failedDelivery.error, failedDelivery.retryCount]
  );

  // Option 2: Message queue (SQS, Redis, etc.)
  await messageQueue.send('webhook-dlq', failedDelivery);

  // Alert the team
  console.error('Event sent to dead letter queue:', failedDelivery.event.id);
}

Processing the Dead Letter Queue

Build a mechanism to review and replay failed events:

// Admin endpoint to replay events from the DLQ
app.post('/admin/dlq/replay/:id', async (req, res) => {
  const dlqEntry = await db.query(
    'SELECT * FROM webhook_dead_letter_queue WHERE id = $1',
    [req.params.id]
  );

  if (!dlqEntry.rows.length) {
    return res.status(404).json({ error: 'DLQ entry not found' });
  }

  const event = JSON.parse(dlqEntry.rows[0].event_data);

  try {
    await processEvent(event);

    // Mark as replayed successfully
    await db.query(
      `UPDATE webhook_dead_letter_queue
       SET replayed_at = NOW(), replay_status = 'success'
       WHERE id = $1`,
      [req.params.id]
    );

    res.json({ status: 'replayed successfully' });
  } catch (error) {
    await db.query(
      `UPDATE webhook_dead_letter_queue
       SET replay_status = 'failed', replay_error = $1
       WHERE id = $2`,
      [error.message, req.params.id]
    );

    res.status(500).json({ error: 'Replay failed', details: error.message });
  }
});

When replaying events from a dead letter queue, remember that time has passed since the original event. The state of your system may have changed. Always design replay logic to check current state and handle situations where the event is no longer relevant — for example, an order may have been manually fulfilled while the webhook was stuck in the DLQ.

Monitoring Delivery Health

Proactive monitoring catches delivery issues before they result in lost events.

Key Metrics to Track

  • First-attempt success rate — what percentage of webhooks succeed on the first try?
  • Retry rate — how often are retries needed? A rising retry rate signals endpoint issues.
  • Time to successful delivery — how long does it take for retried events to eventually succeed?
  • DLQ volume — how many events end up in the dead letter queue?
  • Processing errors by type — which event types cause the most failures?

Setting Up Alerts

// Track webhook delivery metrics
const metrics = {
  async recordDelivery(eventId, eventType, status, attemptNumber) {
    await db.query(
      `INSERT INTO webhook_metrics
       (event_id, event_type, status, attempt_number, recorded_at)
       VALUES ($1, $2, $3, $4, NOW())`,
      [eventId, eventType, status, attemptNumber]
    );

    // Alert on concerning patterns
    if (status === 'failed' && attemptNumber >= 3) {
      await alertTeam(`Webhook ${eventId} has failed ${attemptNumber} times`);
    }
  }
};

app.post('/webhook', async (req, res) => {
  const eventId = req.body.id;
  const attemptNumber = parseInt(req.headers['x-retry-count'] || '1');

  try {
    await processEvent(req.body);
    await metrics.recordDelivery(eventId, req.body.type, 'success', attemptNumber);
    res.status(200).json({ received: true });
  } catch (error) {
    await metrics.recordDelivery(eventId, req.body.type, 'failed', attemptNumber);
    res.status(500).json({ error: 'Processing failed' });
  }
});

Building custom monitoring infrastructure is time-consuming. Webhookify provides comprehensive delivery monitoring out of the box — every webhook is logged with full request details, timing, and status. You get real-time alerts via Telegram, Discord, Slack, email, or push notifications when deliveries fail, so you can fix issues within the retry window. The AI-powered analysis even identifies patterns in failures, helping you pinpoint root causes faster.

Responding Correctly to Control Retries

Your HTTP response status code directly controls whether the provider retries:

| Status Code | Provider Action | |---|---| | 200-299 | Success — no retry | | 301, 302 | Some providers follow redirects; others treat as failure | | 400 | Bad request — most providers do not retry (your fault, not theirs) | | 401, 403 | Auth error — some retry, some disable the subscription | | 404 | Not found — typically no retry | | 410 | Gone — provider disables the webhook subscription | | 429 | Rate limited — provider retries with backoff | | 500-599 | Server error — provider retries | | Timeout | No response — provider retries |

Understanding these behaviors helps you control retry flow. Return 200 to stop retries (even if you are queuing the event internally), return 500 to request a retry, and return 410 if you want the provider to stop sending webhooks entirely.

Never Miss a Webhook Again

Webhookify logs every webhook delivery, monitors retry patterns, and alerts you in real-time via Telegram, Discord, Slack, email, or push notifications. Catch delivery issues before events are lost.

Start Monitoring Free

Summary

Building reliable webhook integrations requires understanding both sides of the delivery equation:

  1. Providers retry failed deliveries using exponential backoff over hours or days.
  2. Retries create duplicates that your handler must detect and skip.
  3. Idempotency keys (event IDs stored in Redis or your database) prevent double-processing.
  4. Atomic check-and-process prevents race conditions with concurrent deliveries.
  5. Dead letter queues capture events that fail all retry attempts.
  6. Monitoring catches delivery issues within the retry window.
  7. Correct HTTP status codes control whether the provider retries or gives up.

Further Reading

Related Articles

Frequently Asked Questions

Webhook Retry Logic and Idempotency: A Developer's Guide - Webhookify | Webhookify