Back to Blog
EngineeringDevelopmentAPI

Writing Webhooks That Don't Ruin Your Weekend

Webhooks look simple until you put them in production. Here are the specific things that go wrong and the patterns that actually make them reliable.

EvolRed Team··7 min read

Webhooks are deceptively simple. A sends a POST to B, B does something with it, done. This is what the tutorials look like. Then you put them in production and learn that the happy path was twenty percent of the work.

The other eighty percent is the stuff that only shows up when something is actually depending on the webhook: retries, replays, ordering, authentication, and the class of failure modes that only manifest on a Saturday night.

Here is what we have learned, having written webhook producers and consumers for enough systems to have strong opinions.

Webhooks Fail. Design for That First.

The single most important thing to understand about webhooks is that delivery is not guaranteed, even with the best-behaved producer. Networks drop requests. Your consumer will be down when the webhook fires. The producer's retry will happen at the wrong time. A TLS handshake will time out halfway through.

Starting from "deliveries are best-effort and may arrive zero, one, or many times" changes how you design everything downstream. You need idempotent consumers. You need replay tooling. You need a way to reconcile state when webhooks are lost.

If you cannot articulate what happens in your system when a webhook is delivered three times, you have a bug waiting to happen.

The Idempotency Key Is Non-Negotiable

Every webhook payload should carry a unique event ID, and your consumer should record which event IDs it has already processed.

This sounds obvious and is routinely skipped. The typical shape we see: a consumer that processes every incoming webhook, creates a record in the database, and assumes the producer will not send the same event twice. Then the producer retries because the response was slow, and you get duplicates.

The fix is one database column. When a webhook arrives, check the event ID against a processed_events table. If it is there, return 200 and do nothing. If not, process it and insert the ID. Use a unique constraint so the insert fails rather than allowing a race condition.

Stripe's webhook documentation has been the reference for this pattern for years, and it is still the right approach. The specifics vary by provider, but the pattern does not.

Signature Verification, Every Time

Any webhook endpoint exposed to the internet should verify a signature on every request. Without this, anyone who learns your URL can send you arbitrary events.

Most producers provide a signing mechanism — Stripe uses HMAC-SHA256 with a shared secret, GitHub uses a similar pattern, Slack uses a timestamp plus signature scheme. The details vary; the commitment is the same. Verify the signature, reject anything that fails, and treat the signing secret like any other secret.

Two specific things to get right: verify against the raw request body, not the parsed JSON (many signatures are sensitive to formatting), and check the timestamp to reject replayed old events.

Respond Fast, Process Async

The other source of webhook pain is slow consumers. The producer has a timeout — usually somewhere between 10 and 30 seconds — and will retry if your response does not arrive in time. This means that a webhook handler that does significant work synchronously is setting itself up for duplicate deliveries.

The correct shape is: accept the webhook, record it, respond with 200 immediately, and process asynchronously. A queue, a background worker, whatever fits your stack.

The consumer's job is to say "I got it" quickly. Everything else is downstream work that can take as long as it needs to, with its own retry logic, error handling, and observability.

This also means that if the downstream work fails, that failure should not cause the webhook to be re-delivered. The event has already been acknowledged. The consumer's internal retry logic is responsible from there.

Build Replay Tooling Before You Need It

The question "can we replay webhooks from the last 24 hours?" will be asked. It will be asked at 11pm on a Saturday, when something in your system has gone wrong and you need to catch up on missed state.

You have two options. You can have thought about this in advance and have a tool that does it. Or you can not have thought about it, in which case you will be writing the tool at 11pm on a Saturday.

The implementation depends on what you are integrating with. Some providers offer replay endpoints — Stripe, GitHub, and most mature webhook producers have some equivalent. For producers that do not, you need to persist everything you receive (with the idempotency key and signature intact) so you can reconstruct state from your own records.

Either way: decide in advance how you would recover from a missed-webhooks incident, and build the minimum tooling to do so before it happens.

The Ordering Problem

Webhooks are not guaranteed to arrive in the order the events occurred. This surprises people regularly.

The producer sends event A at 10:00:00 and event B at 10:00:01. A takes longer to deliver because of a network hiccup. Your consumer sees B first, then A. If your processing logic assumes order, you now have inconsistent state.

The fix depends on what the events represent. If they carry enough information for your consumer to reconstruct the correct state regardless of order, order does not matter. If they are deltas — "add this thing", "remove this thing" — you need to handle them idempotently and reconcile.

Where possible, prefer webhooks that carry the full current state rather than deltas, or at minimum carry a version or timestamp you can use to resolve out-of-order arrivals.

Observability Is Not Optional

Every webhook should be logged on arrival with its event ID, its signature verification result, and its processing outcome. Every failure should alert you in a way you will actually see.

The specific thing we check when reviewing webhook implementations: can you answer "how many webhooks did we receive in the last hour, and how many failed?" in under a minute? If no, the observability is not there yet.

This matters more for webhooks than for most other system components because webhook failures are silent. There is no user on the other end refreshing the page. The only way you will know things are broken is if you are watching, which means you need to set up the watching before the breakage.

The Producing Side

We have mostly talked about consuming webhooks, but if you are producing them, the responsibilities mirror.

Include a unique event ID on every event. Sign the payload. Retry on failure with exponential backoff. Do not retry forever — give up after some point and surface the failure. Provide a replay mechanism. Document your timeout, your retry schedule, and the order guarantees (or lack thereof) that consumers can rely on.

Your consumers will thank you for any of this. All of it together is rare, and it is why webhook integrations with well-run providers feel so much less painful than integrations with poorly-run ones. The same principles apply when designing APIs more broadly — the parts consumers can see matter more than the parts you find clever.

The Uncomfortable Truth

Most webhook implementations we see are fine until they are not, and when they are not, the failure is ugly. The gap between "works in the happy path" and "reliable in production" is not technically hard. It is just work that does not feel like progress when you are building the feature.

Spending a day up front on idempotency, signature verification, async processing, and replay tooling is one of the highest-leverage things you can do on any webhook integration. Skipping it is the reliable route to weekend pager incidents.


Building something that depends on a webhook integration working reliably? Get in touch — this is one of our favourite topics, and it comes up regularly in our custom development work.