Webhook Error Handling Best Practices

Published Feb 21 202611 min read
Webhook error handling architecture showing queue-then-acknowledge pattern and failure recovery

Handling webhook errors correctly is the difference between a reliable integration and a fragile one. When your webhook endpoint mishandles errors — returning wrong status codes, processing too slowly, or failing silently — you lose events, trigger unnecessary retries, process duplicates, and miss critical business data. This guide covers every error handling pattern you need to build webhook consumers that are resilient, observable, and production-ready.

The Foundation: Respond First, Process Later

The single most important principle of webhook error handling is: acknowledge receipt immediately, then process asynchronously. Every other pattern in this guide builds on this foundation.

Why Synchronous Processing Fails

When you process a webhook synchronously (handling all business logic before returning a response), several things can go wrong:

// BAD: Synchronous processing — many failure modes
app.post('/webhook', async (req, res) => {
  try {
    const event = req.body;

    // Each of these steps can fail or be slow:
    await verifySignature(req);           // 10ms
    await lookupCustomer(event.data);     // 200ms
    await updateDatabase(event.data);     // 500ms
    await sendConfirmationEmail(event);   // 2000ms
    await notifySlackChannel(event);      // 1000ms
    await updateAnalytics(event);         // 300ms

    // Total: 4+ seconds — dangerously close to timeout
    res.status(200).json({ received: true });
  } catch (error) {
    // ANY failure returns 500, triggering a full retry
    res.status(500).json({ error: 'Processing failed' });
  }
});

Problems with this approach:

  • If the email service is slow, the entire webhook times out
  • If analytics fails, a payment webhook gets retried — potentially charging twice
  • The provider's delivery worker is blocked for 4+ seconds
  • Any single failure causes the entire processing chain to fail

The Async Alternative

// GOOD: Acknowledge immediately, process asynchronously
app.post('/webhook',
  express.raw({ type: 'application/json' }),
  async (req, res) => {
    try {
      // Step 1: Verify signature (fast — must be synchronous)
      const event = verifyAndParse(req);

      // Step 2: Queue for processing (fast — just a write)
      await eventQueue.add('webhook-processing', {
        event,
        receivedAt: Date.now()
      });

      // Step 3: Respond immediately
      res.status(200).json({ received: true });
    } catch (error) {
      if (error.type === 'signature_invalid') {
        res.status(401).json({ error: 'Invalid signature' });
      } else {
        res.status(500).json({ error: 'Failed to queue event' });
      }
    }
  }
);

The endpoint does only two things: verify the signature and write the event to a queue. Everything else happens in a background worker.

The Queue-Then-Acknowledge Pattern

The queue-then-ack pattern is the gold standard for webhook processing. Here is a complete implementation:

Step 1: Receive and Queue

const { Queue, Worker } = require('bullmq');

const webhookQueue = new Queue('webhooks', {
  connection: { host: 'localhost', port: 6379 }
});

app.post('/webhook',
  express.raw({ type: 'application/json' }),
  async (req, res) => {
    // Verify signature
    const signature = req.headers['x-webhook-signature'];
    if (!verifySignature(req.body, signature, process.env.WEBHOOK_SECRET)) {
      return res.status(401).json({ error: 'Invalid signature' });
    }

    const event = JSON.parse(req.body);

    // Write to durable queue
    await webhookQueue.add(event.type, {
      id: event.id,
      type: event.type,
      data: event.data,
      headers: {
        signature: req.headers['x-webhook-signature'],
        timestamp: req.headers['x-webhook-timestamp']
      },
      receivedAt: new Date().toISOString()
    }, {
      // BullMQ job options
      attempts: 5,
      backoff: { type: 'exponential', delay: 5000 },
      removeOnComplete: { age: 86400 }, // Keep completed jobs for 24 hours
      removeOnFail: false // Keep failed jobs for inspection
    });

    res.status(200).json({ received: true });
  }
);

Step 2: Process in Background Worker

const worker = new Worker('webhooks', async (job) => {
  const { id, type, data } = job.data;

  console.log(`Processing webhook ${id} (${type}), attempt ${job.attemptsMade + 1}`);

  switch (type) {
    case 'payment_intent.succeeded':
      await handlePaymentSuccess(data);
      break;
    case 'customer.subscription.deleted':
      await handleSubscriptionCancellation(data);
      break;
    case 'invoice.payment_failed':
      await handlePaymentFailure(data);
      break;
    default:
      console.log(`Unhandled event type: ${type}`);
  }
}, {
  connection: { host: 'localhost', port: 6379 },
  concurrency: 10 // Process up to 10 events simultaneously
});

// Handle worker-level events
worker.on('completed', (job) => {
  console.log(`Webhook ${job.data.id} processed successfully`);
});

worker.on('failed', (job, err) => {
  console.error(`Webhook ${job.data.id} failed:`, err.message);
  if (job.attemptsMade >= job.opts.attempts) {
    // All retries exhausted — send to dead letter queue
    moveToDeadLetterQueue(job.data, err.message);
  }
});

Step 3: Dead Letter Queue for Final Failures

async function moveToDeadLetterQueue(eventData, errorMessage) {
  await db.query(
    `INSERT INTO webhook_dead_letter_queue
     (event_id, event_type, payload, error_message, failed_at, attempts)
     VALUES ($1, $2, $3, $4, NOW(), $5)`,
    [
      eventData.id,
      eventData.type,
      JSON.stringify(eventData),
      errorMessage,
      eventData.attempts || 5
    ]
  );

  // Alert the team
  await sendAlert({
    channel: 'webhook-failures',
    message: `Webhook event ${eventData.id} (${eventData.type}) moved to dead letter queue after ${eventData.attempts} failed attempts. Error: ${errorMessage}`
  });
}

The queue-then-ack pattern handles the tension between two competing requirements: the provider wants a fast response (within seconds), and your business logic might need to do slow operations (database queries, API calls, emails). By separating receipt from processing, you satisfy both requirements without compromise.

Returning the Right HTTP Status Codes

Your HTTP response directly controls the provider's behavior. Return the wrong code, and you either lose events or create unnecessary retries.

Status Codes and Their Effects

app.post('/webhook', async (req, res) => {
  // 200: Success — event received and accepted
  // Provider will NOT retry
  res.status(200).json({ received: true });

  // 202: Accepted — event received, processing deferred
  // Provider will NOT retry (2xx = success)
  res.status(202).json({ accepted: true, processing: 'queued' });

  // 400: Bad Request — invalid payload (your fault or theirs)
  // Most providers will NOT retry (client error)
  res.status(400).json({ error: 'Invalid payload format' });

  // 401: Unauthorized — signature verification failed
  // Behavior varies: some retry, some disable the webhook
  res.status(401).json({ error: 'Invalid signature' });

  // 410: Gone — endpoint permanently removed
  // Provider will DISABLE the webhook subscription
  res.status(410).json({ error: 'Endpoint no longer exists' });

  // 429: Too Many Requests — rate limited
  // Provider will retry with backoff
  res.status(429).json({ error: 'Rate limit exceeded' });

  // 500: Internal Server Error — temporary failure
  // Provider WILL retry
  res.status(500).json({ error: 'Internal error, please retry' });

  // 503: Service Unavailable — temporarily down
  // Provider WILL retry
  res.status(503).json({ error: 'Service temporarily unavailable' });
});

Strategic Status Code Usage

Use status codes strategically to control retry behavior:

app.post('/webhook', async (req, res) => {
  try {
    // Signature verification — return 401 if invalid
    if (!verifySignature(req)) {
      return res.status(401).json({ error: 'Invalid signature' });
    }

    // Payload validation — return 400 for permanently invalid payloads
    if (!isValidPayload(req.body)) {
      return res.status(400).json({ error: 'Invalid payload' });
    }

    // Queue the event
    await queue.add(req.body);

    // Return 200 — event is safely queued
    return res.status(200).json({ received: true });
  } catch (error) {
    if (error.code === 'QUEUE_UNAVAILABLE') {
      // Queue is down — return 503 to trigger retry
      return res.status(503).json({ error: 'Queue unavailable' });
    }

    // Unknown error — return 500 to trigger retry
    return res.status(500).json({ error: 'Internal error' });
  }
});

Be very careful with 410 (Gone). Returning this status code tells the provider to permanently disable your webhook subscription. Only use it when you intentionally want to stop receiving webhooks. An accidental 410 response during a deployment can require manual re-registration of the webhook.

Circuit Breaker Pattern

When your webhook handler depends on external services (databases, APIs, email providers), a failure in one service can cause all webhook processing to fail. The circuit breaker pattern prevents cascading failures.

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 60000; // 1 minute
    this.state = 'CLOSED'; // CLOSED = normal, OPEN = failing, HALF_OPEN = testing
    this.failureCount = 0;
    this.lastFailureTime = null;
  }

  async execute(fn) {
    if (this.state === 'OPEN') {
      // Check if enough time has passed to try again
      if (Date.now() - this.lastFailureTime >= this.resetTimeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN — service unavailable');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      console.error(`Circuit breaker OPENED after ${this.failureCount} failures`);
    }
  }
}

// Usage in webhook processing
const dbCircuitBreaker = new CircuitBreaker({ failureThreshold: 3, resetTimeout: 30000 });
const emailCircuitBreaker = new CircuitBreaker({ failureThreshold: 5, resetTimeout: 60000 });

async function processPaymentWebhook(event) {
  // Database update — critical, fail the job if circuit is open
  await dbCircuitBreaker.execute(async () => {
    await db.query('UPDATE orders SET status = $1 WHERE id = $2',
      ['paid', event.data.order_id]);
  });

  // Email notification — non-critical, skip gracefully if circuit is open
  try {
    await emailCircuitBreaker.execute(async () => {
      await sendReceiptEmail(event.data.customer_email, event.data);
    });
  } catch (error) {
    console.warn('Email circuit breaker open, skipping email notification');
    await queueEmailForLater(event.data);
  }
}

Graceful Degradation

Not all processing steps are equally important. Design your webhook handler to degrade gracefully when non-critical operations fail.

async function handleOrderWebhook(event) {
  const results = {
    critical: [],
    nonCritical: []
  };

  // CRITICAL: These MUST succeed or the job should retry
  try {
    await updateOrderDatabase(event.data);
    results.critical.push({ step: 'database', status: 'success' });
  } catch (error) {
    results.critical.push({ step: 'database', status: 'failed', error: error.message });
    throw error; // Re-throw to trigger job retry
  }

  // NON-CRITICAL: These SHOULD succeed but failure is acceptable
  const nonCriticalTasks = [
    { name: 'email', fn: () => sendConfirmationEmail(event.data) },
    { name: 'analytics', fn: () => trackAnalyticsEvent(event.data) },
    { name: 'slack', fn: () => notifySlackChannel(event.data) },
    { name: 'crm', fn: () => updateCRMRecord(event.data) }
  ];

  for (const task of nonCriticalTasks) {
    try {
      await task.fn();
      results.nonCritical.push({ step: task.name, status: 'success' });
    } catch (error) {
      console.warn(`Non-critical task '${task.name}' failed:`, error.message);
      results.nonCritical.push({ step: task.name, status: 'failed', error: error.message });

      // Queue failed non-critical tasks for later retry
      await retryQueue.add('retry-task', {
        task: task.name,
        eventId: event.id,
        data: event.data,
        failedAt: new Date().toISOString()
      });
    }
  }

  return results;
}

This approach ensures that a failure in your Slack notification does not prevent a payment from being recorded. Critical operations cause retries; non-critical operations fail gracefully and are queued for later processing.

Alerting on Webhook Failures

You cannot fix what you do not know about. Set up comprehensive alerting for webhook failures:

Alert Levels

const alertConfig = {
  // Level 1: Single failure — log it
  singleFailure: (event, error) => {
    console.error(`Webhook processing failed: ${event.id}`, error.message);
  },

  // Level 2: Repeated failures — alert the team
  repeatedFailures: async (event, error, attemptCount) => {
    if (attemptCount >= 3) {
      await sendSlackAlert({
        channel: '#webhook-alerts',
        text: `Webhook ${event.id} (${event.type}) has failed ${attemptCount} times. Latest error: ${error.message}`
      });
    }
  },

  // Level 3: Dead letter queue — page someone
  deadLetterQueue: async (event, error) => {
    await sendPagerDutyAlert({
      severity: 'high',
      summary: `Webhook ${event.id} moved to dead letter queue after all retries exhausted`,
      details: {
        eventType: event.type,
        eventId: event.id,
        error: error.message
      }
    });
  },

  // Level 4: High failure rate — incident
  highFailureRate: async (failureRate, window) => {
    if (failureRate > 0.1) { // More than 10% failure rate
      await sendPagerDutyAlert({
        severity: 'critical',
        summary: `Webhook failure rate is ${(failureRate * 100).toFixed(1)}% over the last ${window} minutes`
      });
    }
  }
};

Monitoring with Webhookify

Rather than building custom alerting infrastructure, Webhookify provides real-time alerts for webhook delivery and processing issues. When webhooks to your Webhookify endpoints fail or show unusual patterns, you receive immediate notifications via Telegram, Discord, Slack, email, or push notifications. The mobile app plays a distinctive cash register sound for payment events, giving you audible confirmation that revenue is flowing.

Set up a two-tier alerting system: use Webhookify for real-time delivery monitoring (catching issues at the network level), and use your own application-level alerts for processing failures (catching issues in your business logic). This gives you complete visibility across both layers of the webhook pipeline.

Handling Specific Error Scenarios

Database Unavailable

async function handleWithDatabaseFallback(event) {
  try {
    await db.query('INSERT INTO events (id, data) VALUES ($1, $2)',
      [event.id, JSON.stringify(event.data)]);
  } catch (error) {
    if (error.code === 'ECONNREFUSED' || error.code === 'ETIMEDOUT') {
      // Database is down — write to local fallback
      await writeToLocalFallback(event);
      console.error('Database unavailable, event written to local fallback');
      return; // Do not throw — event is safely persisted
    }
    throw error; // Re-throw other database errors
  }
}

async function writeToLocalFallback(event) {
  const fallbackPath = `/tmp/webhook-fallback/${event.id}.json`;
  await fs.writeFile(fallbackPath, JSON.stringify(event));
  // A recovery job periodically reads from this directory and writes to the database
}

External API Timeout

async function callExternalAPIWithTimeout(url, data, timeoutMs = 5000) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const response = await fetch(url, {
      method: 'POST',
      body: JSON.stringify(data),
      headers: { 'Content-Type': 'application/json' },
      signal: controller.signal
    });

    if (!response.ok) {
      throw new Error(`API returned ${response.status}`);
    }

    return await response.json();
  } catch (error) {
    if (error.name === 'AbortError') {
      throw new Error(`External API timed out after ${timeoutMs}ms`);
    }
    throw error;
  } finally {
    clearTimeout(timeout);
  }
}

Payload Schema Changes

async function handleWithSchemaValidation(event) {
  // Validate against expected schema
  const validation = validateSchema(event);

  if (!validation.valid) {
    // Log the schema violation for investigation
    console.warn('Webhook payload schema violation:', {
      eventId: event.id,
      eventType: event.type,
      errors: validation.errors,
      payload: JSON.stringify(event.data).slice(0, 500)
    });

    // Try to process with available fields (forward compatibility)
    try {
      await processWithFlexibleSchema(event);
    } catch (processingError) {
      // Cannot process — alert team about potential breaking change
      await sendAlert(`Webhook schema change detected for ${event.type}. Review required.`);
      throw processingError;
    }
  } else {
    await processEvent(event);
  }
}

Error Handling Architecture Summary

Here is the complete error handling architecture in one view:

Webhook Received
    │
    ├── Signature Invalid? ──> Return 401
    │
    ├── Payload Invalid? ──> Return 400
    │
    ├── Queue Write Fails? ──> Return 503 (trigger provider retry)
    │
    └── Queue Write Succeeds ──> Return 200
                │
                ▼
        Background Worker
                │
                ├── Processing Succeeds ──> Mark Complete
                │
                ├── Critical Step Fails ──> Retry with Backoff
                │       │
                │       ├── Retry Succeeds ──> Mark Complete
                │       │
                │       └── All Retries Fail ──> Dead Letter Queue + Alert
                │
                └── Non-Critical Step Fails ──> Log + Queue for Later

Real-Time Webhook Failure Alerts

Webhookify monitors every webhook delivery and alerts you instantly via Telegram, Discord, Slack, email, or push notifications when failures occur. Catch issues before they impact your users.

Set Up Alerts Free

Further Reading

Related Articles

Frequently Asked Questions

Webhook Error Handling Best Practices - Webhookify | Webhookify