Severe service disruption across the platform in EU

Incident Report for Onfido

Postmortem

Summary

On January 7th, 2026, between 13:05 UTC and 15:50 UTC, our EU region experienced severe instability impacting multiple products.

Impact Overview:

  • Autofill → Errors for new document requests (13:05 – 13:39)
  • Facial Similarity → Delays in report processing (13:05 – 14:16)
  • Documents API → Elevated error rates (13:05 – 15:50)
  • Platform Stability → Severe degradation during migrations (13:49 – 14:07, 14:34 – 15:15)

Root Causes

On September 2024 we suffered an incident, also in the EU region, due to the overflow of an integer primary key in a table supporting our Applicants (Post Mortem). As remedies to that incident we:

  • immediately put in place alerting on other tables experiencing similar growth across our systems;
  • scheduled proactive table migrations to avoid other similar disruptions of service.

As part of those scheduled migrations, during 2025, we migrated multiple tables that store information about Documents. However, this work was inadvertently incomplete, since we missed some tables that held references to Document identifiers. These tables were managed by a different subsystem without explicit foreign key constraints, making the dependency invisible during our audit.

On the 7th, as predicted by our monitoring, another primary key which had already been migrated, went over the former maximum limit. This core area of our system didn’t experience any disruption, thanks to that proactive work. As those larger integers started being propagated internally through our systems, we started seeing a few cascading failures in downstream areas.

While trying to address the ongoing issues, our mitigations were responsible for worsening the impact and causing a full outage (14:34 to 15:15). A column migration was missing an index, which caused inefficient queries. These queries were executed against the writer database, leading to excessive CPU usage and making the writer instance non-responsive. To mitigate, we upscaled a read replica and triggered a database failover to restore stability.

Timeline (UTC)

13:05 - Multiple monitors triggered on-call due to errors in Autofill requests, Facial Similarity reports, and Document uploads (small percentage of applicants affected).

13:10 - Root cause identified and the on-call team starts looking for the affected tables.

13:23 - Main tables impacted by the incident identified.

13:29 - Migration initiated for the table affecting Autofill product.

13:39 - Migration completed; Autofill requests restored.

13:42 - Team applies same migration approach to table affecting Facial Similarity reports.

13:49 - Migration started for table affecting Document uploads.

14:07 - Widespread incident detected related with Document upload migration; migration canceled.

14:16 - Migration for Facial Similarity reports completed; report execution resumes.

14:34 - Team attempts alternative migration approach for Document upload table.

14:38 - New migration attempt causes severe service disruption.

14:50 - Failover to new upscaled replica.

15:15 - Migration completed successfully; processing resumes.

15:20 - High error rate persists on Documents API despite successful migrations.

15:40 - Surge of retry requests from multiple customers identified as root cause; infrastructure scaled up.

15:50 - Documents API fully recovered.

Remedies

We expected our mitigation measures from the last incident of this type (Sept 2024) to have prevented further similar disruptions. Unfortunately those efforts proved to be insufficient. As a result, we have triggered an internal Code Yellow, with a dedicated squad aimed at fully auditing all our systems, to exhaustively identify any outstanding unknown risks.

We will combine a metadata-driven approach (reviewing schemas, code, and dependencies) with a data-driven approach (validating live data and query patterns) to identify any hidden references or risks across our systems. Once identified, we will proactively migrate these areas and accelerate the migration of tables or components that may pose future risk to ensure long-term stability.

Additionally, we will proactively simulate integer overflow scenarios across multiple critical areas of our system that are approaching growth thresholds. These simulations will help us test system behavior under transition conditions, uncover hidden dependencies, and ensure that our mitigation strategies (such as migrations, indexing, and failover procedures) are effective before real limits are reached.

We have updated our runbooks to include safeguards and best practices learned from this incident, ensuring that future emergency migrations are executed in a way that minimizes risk and avoids worsening the impact.

Posted Jan 14, 2026 - 09:41 UTC

Resolved

This incident has been resolved.

We apologise for the service disruption.

We'll share a public post-mortem later.
Posted Jan 07, 2026 - 16:21 UTC

Update

We continue to monitor the recovery and work on processing reports delayed during the incident.
Posted Jan 07, 2026 - 16:00 UTC

Update

Processing has recovered across the platform.
Posted Jan 07, 2026 - 15:57 UTC

Update

Multiple products are stable, but still see some degraded performance.
Posted Jan 07, 2026 - 15:45 UTC

Update

Further recovery steps have been taken to stabilize the platform.
Posted Jan 07, 2026 - 15:43 UTC

Update

We are recovering, but still have high error rates. We continue to investigate residual issues.
Posted Jan 07, 2026 - 15:29 UTC

Monitoring

A fix has been implemented and we are monitoring.
Posted Jan 07, 2026 - 14:28 UTC

Identified

We have identified the issue and we are working on a fix
Posted Jan 07, 2026 - 13:39 UTC

Investigating

We are currently investigating an issue that is causing service disruption across multiple products.
Posted Jan 07, 2026 - 13:13 UTC
This incident affected: Europe (onfido.com) (API, Dashboard, Applicant Form, Document Verification, Facial Similarity, Watchlist, Identity Enhanced, Webhooks, Known faces, Autofill, QES, Device Intelligence).