Slow delivery of webhook messages
Incident Report for Onfido
Postmortem

Summary

A customer with misconfigured webhooks started sending an anomalously high number of requests, causing our webhook retry queue to back up. This resulted in high webhook latency across all customers for a period of just over an hour, with webhook retries - those webhooks that failed on first or subsequent requests - being the worst affected.

Timeline

  • 20th April 13:28 BST: our on-call engineers were alerted that the webhook queue size was above normal levels and webhook notifications were delayed
  • 20th April 13:45 BST: after triaging, we updated status.onfido.com to alert customers of the issue
  • 20th April 14:06 BST: after investigation, the on-call engineers purged the webhook retry queue
  • 20th April 14:07 BST: webhook latency returned to normal and our alerts resolved
  • 20th April 14:40 BST: we identified the retry webhook queue growth was due to a single customer with misconfigured webhooks. We adjusted the rate limit on the customer with the broken webhooks and contacted them to help fix their configuration.
  • 20th April 14:47 BST: we updated status.onfido.com to mark the incident as resolved

Root Cause

Our webhook service has an exponential back-off retry mechanism to ensure messages are delivered in the case of network errors or downtime in our customers’ servers.

This mechanism was not able to recognise that misconfigured webhooks would never succeed and therefore we should not attempt to continually resend failed messages.

High load from the affected customer caused messages to build up on the retry queue. This had a two-fold impact:

  1. This load started to delay retries for other customers
  2. The high number of “in flight” messages reached the maximum allowable limit on the backing queue service, which caused webhook message processors to fail and restart.

Both issues increased latency between issuing and delivering webhook events.

Remedies

We took a number of immediate steps

  1. We adjusted the rate limit for the affected customer to prevent their usage pattern from impacting other customers.
  2. We disabled the misconfigured webhooks and worked with the affected customer to fix their integration
  3. We increased the limit on “in flight” messages on the backing queue service, in line with behaviour observed during this incident

As well as the immediate steps we took above, we will:

  1. Add a circuit-breaking mechanism that will prevent retrying webhooks that are constantly failing and automatically notify the affected customer
  2. Limit the total number of webhooks each customer can configure
Posted Jul 01, 2019 - 13:07 UTC

Resolved
This issue is now resolved and webhook delivery has returned to normal.
Posted Apr 20, 2019 - 13:47 UTC
Monitoring
We have implemented a fix for this issue. We are monitoring closely to make sure this issue has been resolved and everything is working as expected.

Please bear with us while we get this feature back on its feet; we appreciate your patience during this incident.
Posted Apr 20, 2019 - 13:31 UTC
Identified
The issue has been identified and a fix is being implemented.

We will provide a further update at 14:30 UTC
Posted Apr 20, 2019 - 13:16 UTC
Update
We are still investigating to determine the cause of instability.
We will update at 14:15 UTC
Posted Apr 20, 2019 - 13:02 UTC
Investigating
We've currently experiencing issues that are negatively impacting the speed of webhook delivery. This is particularly impacting webhook message retries

We will update at 14:00 UTC
Posted Apr 20, 2019 - 12:45 UTC
This incident affected: Europe (onfido.com) (API).