Summary
Between 03:00 UTC and 05:20 UTC on 28th March, a large increase in duplicate webhook notifications led to webhook queues starting to grow.
Webhook events created between 05:20 and 09:37 were stalled due to the message queue becoming completely congested and were therefore not delivered; delivery of these queued webhooks resumed at 12:10 and were fully cleared by 13:30.
All new events from 09:37 were processed as normal.
Root Causes
We ran into a quota limit imposed by our cloud provider for the maximum size of messages queued. This was caused by a combination of:
- An unusually high volume of duplicated webhooks messages; and
- The configuration of our retry strategy.
All webhooks messages were being processed via a single queue; to overcome the quota limit, a secondary queue was required to allow new messages to be processed.
The retry process was preventing previously queued messages from being cleared; hence, retries were temporarily suspended to resume processing.
Timeline
- 28/03/2024 03:00 UTC: Webhook events began to get congested, leading to a gradual degradation in webhook processing.
- 28/03/2024 04:08 UTC: Monitoring alerted on-call to a build-up in queued webhook events, and an engineer began investigating.
- 28/03/2024 04:20-04:55 UTC: The webhook service was scaled up, but this did not resolve the problem.
- 28/03/2024 05:20 UTC: In-flight message quota is reached and no delivery for newly created events.
- 28/03/2024 08:25 UTC: AWS were contacted to increase in-flight message quota.
- 28/03/2024 08:30 UTC: It was determined that a secondary queue, inheriting an increased quota, would be required to unblock new webhook deliveries.
- 28/03/2024 09:37 UTC: Secondary queue deployed, unblocking new webhook deliveries; in the mean time, engineers continue to work on a solution for events generated between 05:20 to 09:37.
- 28/03/2024 12:10 UTC: Retries were temporarily suspended to resolve delivery of the queued webhook events generated between 05:20 to 09:37.
- 28/03/2024 13:30 UTC: All queued webhooks cleared.
Remedies
- Additional monitoring to alert on-call when approaching queue quota limits (DONE).
- Move webhook retries to a dedicated secondary queue to avoid blocking new events from being processed (ETA: April 2024).
- Introduce filtering to discard duplicate messages, to avoid redundant queue expansion (ETA: April 2024).