Unable to create reports in EU

Incident Report for Onfido

Postmortem

Summary

On December 9, 2025, between 09:50 AM UTC and 11:36 AM UTC, our services in the EU region experienced a degradation that affected customers’ ability to list checks in the dashboard. During this period, the webhook logs displayed in the dashboard were also impacted, leading to incomplete or delayed visibility, and the webhook resend feature was similarly degraded.

Furthermore, as a consequence of the ongoing incident, between 11:15 AM UTC and 11:38 AM UTC (a total of 23 mins), our Classic clients were affected by an issue in the check creation service that resulted in no new checks being created during that period in the EU region. In addition, both Classic and Studio clients experienced increased turnaround times (TaT) for ongoing checks and tasks.

Root Causes

The incident occurred because a shared indexing component became unavailable after reaching its capacity limits due to a workload spike from another internal system. We had prior spikes with more than double the load, lasting longer, that caused no issues.

We suspect a timing issue with background maintenance (e.g., garbage collection) caused this. However, we lack sufficient historical telemetry to confirm.

This overload caused the shared infrastructure to fail, which cascaded into disruptions across multiple dependent services and led to errors in client-facing operations.

Timeline

09:50 UTC: Error rate and round trip time increased for an indexing component

09:51 UTC: Our on-call team is notified about an increase in error rate

10:10 UTC: Investigation and actions begin to reduce non-essential traffic and speed recovery

11:08 UTC: Actions to reduce non-essential traffic completed

11:15 UTC: Shared caching infrastructure dependency becomes unavailable

11:15 UTC: Incident impacts check creation; no new checks succeed

11:18 UTC: Upscale indexer as actions to reduce load were insufficient

11:32 UTC: Upscale shared caching infrastructure to resolve checks creation incident

11:36 UTC: Checks and webhook logs show working again on dashboard; resend webhook function still shows outdated information

11:39 UTC: Our on-call team confirms all traffic returned to normal

12:45 UTC: Upscale completed

13:05 UTC: We continued monitoring to ensure backlog items processed correctly

13:20 UTC: Remaining backlog processed

Remedies

  • Introduce rate-limiting, review timeouts, and change query access patterns sent to the indexing service to reduce the risk of excessive load events
  • Extend our monitoring on the shared caching infrastructure to anticipate resource capacity exhaustion;
  • Review and reduce the checks creation service's dependency on shared caching infrastructure.
Posted Dec 19, 2025 - 16:49 UTC

Resolved

The incident has been resolved.

We'll share a public post-mortem later.
Posted Dec 09, 2025 - 12:12 UTC

Monitoring

The API is now fully available. We are monitoring the situation.
Posted Dec 09, 2025 - 11:39 UTC

Identified

We have identified an issue in the infrastructure that handles reports creations.
We are working on restoring it.
Posted Dec 09, 2025 - 11:15 UTC
This incident affected: Europe (onfido.com) (API).