Summary
A customer with misconfigured webhooks started sending an anomalously high number of requests, causing our webhook retry queue to back up. This resulted in high webhook latency across all customers for a period of just over an hour, with webhook retries - those webhooks that failed on first or subsequent requests - being the worst affected.
Timeline
- 20th April 13:28 BST: our on-call engineers were alerted that the webhook queue size was above normal levels and webhook notifications were delayed
- 20th April 13:45 BST: after triaging, we updated status.onfido.com to alert customers of the issue
- 20th April 14:06 BST: after investigation, the on-call engineers purged the webhook retry queue
- 20th April 14:07 BST: webhook latency returned to normal and our alerts resolved
- 20th April 14:40 BST: we identified the retry webhook queue growth was due to a single customer with misconfigured webhooks. We adjusted the rate limit on the customer with the broken webhooks and contacted them to help fix their configuration.
- 20th April 14:47 BST: we updated status.onfido.com to mark the incident as resolved
Root Cause
Our webhook service has an exponential back-off retry mechanism to ensure messages are delivered in the case of network errors or downtime in our customers’ servers.
This mechanism was not able to recognise that misconfigured webhooks would never succeed and therefore we should not attempt to continually resend failed messages.
High load from the affected customer caused messages to build up on the retry queue. This had a two-fold impact:
- This load started to delay retries for other customers
- The high number of “in flight” messages reached the maximum allowable limit on the backing queue service, which caused webhook message processors to fail and restart.
Both issues increased latency between issuing and delivering webhook events.
Remedies
We took a number of immediate steps
- We adjusted the rate limit for the affected customer to prevent their usage pattern from impacting other customers.
- We disabled the misconfigured webhooks and worked with the affected customer to fix their integration
- We increased the limit on “in flight” messages on the backing queue service, in line with behaviour observed during this incident
As well as the immediate steps we took above, we will:
- Add a circuit-breaking mechanism that will prevent retrying webhooks that are constantly failing and automatically notify the affected customer
- Limit the total number of webhooks each customer can configure