webhook delivery degradation
Incident Report for Onfido
Postmortem

Summary

Between 03:00 UTC and 05:20 UTC on 28th March, a large increase in duplicate webhook notifications led to webhook queues starting to grow.

Webhook events created between 05:20 and 09:37 were stalled due to the message queue becoming completely congested and were therefore not delivered; delivery of these queued webhooks resumed at 12:10 and were fully cleared by 13:30.

All new events from 09:37 were processed as normal.

Root Causes

We ran into a quota limit imposed by our cloud provider for the maximum size of messages queued. This was caused by a combination of:

  • An unusually high volume of duplicated webhooks messages; and
  • The configuration of our retry strategy.

All webhooks messages were being processed via a single queue; to overcome the quota limit, a secondary queue was required to allow new messages to be processed.

The retry process was preventing previously queued messages from being cleared; hence, retries were temporarily suspended to resume processing.

Timeline

  • 28/03/2024 03:00 UTC: Webhook events began to get congested, leading to a gradual degradation in webhook processing.
  • 28/03/2024 04:08 UTC: Monitoring alerted on-call to a build-up in queued webhook events, and an engineer began investigating.
  • 28/03/2024 04:20-04:55 UTC: The webhook service was scaled up, but this did not resolve the problem.
  • 28/03/2024 05:20 UTC: In-flight message quota is reached and no delivery for newly created events.
  • 28/03/2024 08:25 UTC: AWS were contacted to increase in-flight message quota.
  • 28/03/2024 08:30 UTC: It was determined that a secondary queue, inheriting an increased quota, would be required to unblock new webhook deliveries.
  • 28/03/2024 09:37 UTC: Secondary queue deployed, unblocking new webhook deliveries; in the mean time, engineers continue to work on a solution for events generated between 05:20 to 09:37.
  • 28/03/2024 12:10 UTC: Retries were temporarily suspended to resolve delivery of the queued webhook events generated between 05:20 to 09:37.
  • 28/03/2024 13:30 UTC: All queued webhooks cleared.

Remedies

  • Additional monitoring to alert on-call when approaching queue quota limits (DONE).
  • Move webhook retries to a dedicated secondary queue to avoid blocking new events from being processed (ETA: April 2024).
  • Introduce filtering to discard duplicate messages, to avoid redundant queue expansion (ETA: April 2024).
Posted Apr 09, 2024 - 08:33 UTC

Resolved
This incident has been resolved.
Posted Mar 28, 2024 - 13:42 UTC
Monitoring
Delivering missing webhooks from 3:00am to 10:30am UTC
Posted Mar 28, 2024 - 13:12 UTC
Update
we're working on a script to resend webhooks from 3 am to 10:30 am UTC+0. During the fix, if you rely on this webhook data, we invite you to switch to API calls to get the data.
Posted Mar 28, 2024 - 11:24 UTC
Update
we've deployed a fix. The new webhooks should be delivered correctly. The old webhooks will still expect latency. we'll update soon about old webhooks.
Posted Mar 28, 2024 - 09:35 UTC
Identified
we've identified the issue. we're applying a fix. we'll update later
Posted Mar 28, 2024 - 08:46 UTC
Update
still investigating the issue
Posted Mar 28, 2024 - 08:09 UTC
Update
Clients may see webhook events duplication or latency. We're still investigating the root cause.
Posted Mar 28, 2024 - 07:11 UTC
Investigating
We are currently investigating the issue
Posted Mar 28, 2024 - 06:28 UTC
This incident affected: Europe (onfido.com) (Webhooks).