On December 9, 2025, between 09:50 AM UTC and 11:36 AM UTC, our services in the EU region experienced a degradation that affected customers’ ability to list checks in the dashboard. During this period, the webhook logs displayed in the dashboard were also impacted, leading to incomplete or delayed visibility, and the webhook resend feature was similarly degraded.
Furthermore, as a consequence of the ongoing incident, between 11:15 AM UTC and 11:38 AM UTC (a total of 23 mins), our Classic clients were affected by an issue in the check creation service that resulted in no new checks being created during that period in the EU region. In addition, both Classic and Studio clients experienced increased turnaround times (TaT) for ongoing checks and tasks.
The incident occurred because a shared indexing component became unavailable after reaching its capacity limits due to a workload spike from another internal system. We had prior spikes with more than double the load, lasting longer, that caused no issues.
We suspect a timing issue with background maintenance (e.g., garbage collection) caused this. However, we lack sufficient historical telemetry to confirm.
This overload caused the shared infrastructure to fail, which cascaded into disruptions across multiple dependent services and led to errors in client-facing operations.
09:50 UTC: Error rate and round trip time increased for an indexing component
09:51 UTC: Our on-call team is notified about an increase in error rate
10:10 UTC: Investigation and actions begin to reduce non-essential traffic and speed recovery
11:08 UTC: Actions to reduce non-essential traffic completed
11:15 UTC: Shared caching infrastructure dependency becomes unavailable
11:15 UTC: Incident impacts check creation; no new checks succeed
11:18 UTC: Upscale indexer as actions to reduce load were insufficient
11:32 UTC: Upscale shared caching infrastructure to resolve checks creation incident
11:36 UTC: Checks and webhook logs show working again on dashboard; resend webhook function still shows outdated information
11:39 UTC: Our on-call team confirms all traffic returned to normal
12:45 UTC: Upscale completed
13:05 UTC: We continued monitoring to ensure backlog items processed correctly
13:20 UTC: Remaining backlog processed