On January 7th, 2026, between 13:05 UTC and 15:50 UTC, our EU region experienced severe instability impacting multiple products.
Impact Overview:
On September 2024 we suffered an incident, also in the EU region, due to the overflow of an integer primary key in a table supporting our Applicants (Post Mortem). As remedies to that incident we:
As part of those scheduled migrations, during 2025, we migrated multiple tables that store information about Documents. However, this work was inadvertently incomplete, since we missed some tables that held references to Document identifiers. These tables were managed by a different subsystem without explicit foreign key constraints, making the dependency invisible during our audit.
On the 7th, as predicted by our monitoring, another primary key which had already been migrated, went over the former maximum limit. This core area of our system didn’t experience any disruption, thanks to that proactive work. As those larger integers started being propagated internally through our systems, we started seeing a few cascading failures in downstream areas.
While trying to address the ongoing issues, our mitigations were responsible for worsening the impact and causing a full outage (14:34 to 15:15). A column migration was missing an index, which caused inefficient queries. These queries were executed against the writer database, leading to excessive CPU usage and making the writer instance non-responsive. To mitigate, we upscaled a read replica and triggered a database failover to restore stability.
13:05 - Multiple monitors triggered on-call due to errors in Autofill requests, Facial Similarity reports, and Document uploads (small percentage of applicants affected).
13:10 - Root cause identified and the on-call team starts looking for the affected tables.
13:23 - Main tables impacted by the incident identified.
13:29 - Migration initiated for the table affecting Autofill product.
13:39 - Migration completed; Autofill requests restored.
13:42 - Team applies same migration approach to table affecting Facial Similarity reports.
13:49 - Migration started for table affecting Document uploads.
14:07 - Widespread incident detected related with Document upload migration; migration canceled.
14:16 - Migration for Facial Similarity reports completed; report execution resumes.
14:34 - Team attempts alternative migration approach for Document upload table.
14:38 - New migration attempt causes severe service disruption.
14:50 - Failover to new upscaled replica.
15:15 - Migration completed successfully; processing resumes.
15:20 - High error rate persists on Documents API despite successful migrations.
15:40 - Surge of retry requests from multiple customers identified as root cause; infrastructure scaled up.
15:50 - Documents API fully recovered.
We expected our mitigation measures from the last incident of this type (Sept 2024) to have prevented further similar disruptions. Unfortunately those efforts proved to be insufficient. As a result, we have triggered an internal Code Yellow, with a dedicated squad aimed at fully auditing all our systems, to exhaustively identify any outstanding unknown risks.
We will combine a metadata-driven approach (reviewing schemas, code, and dependencies) with a data-driven approach (validating live data and query patterns) to identify any hidden references or risks across our systems. Once identified, we will proactively migrate these areas and accelerate the migration of tables or components that may pose future risk to ensure long-term stability.
Additionally, we will proactively simulate integer overflow scenarios across multiple critical areas of our system that are approaching growth thresholds. These simulations will help us test system behavior under transition conditions, uncover hidden dependencies, and ensure that our mitigation strategies (such as migrations, indexing, and failover procedures) are effective before real limits are reached.
We have updated our runbooks to include safeguards and best practices learned from this incident, ensuring that future emergency migrations are executed in a way that minimizes risk and avoids worsening the impact.