Increased turnaround time for document check processing in the EU region

Incident Report for Onfido

Postmortem

Summary

Between 11:47 and 12:10, 55% of documents reports could not be processed due to a fraud detection service partial failure. Starting 12:10 onwards, traffic was being processed as usual and we started rerunning failed reports. Reports that required manual processing saw additional delays of up to 2 hours in order to clear the backlog.

Root Causes

A sudden increase in CPU usage by the impacted fraud service lasted for a few minutes, leading to retries policies being initiated. The service did not manage to scale properly to handle both the ongoing traffic and the retries, leading to a portion of report processing being halted and stored in dead-letter queues for processing after the systems stabilize.

Timeline

11:47 UTC: High CPU usage usage on a fraud detection service leads to errors

11:50 UTC: CPU usage is back to normal, reports that errored out are being retried

11:51 UTC: the fraud detection service doesn’t manage to scale to manage both the normal traffic and retries

11:51 UTC: on-call team is alerted of a high error rate on the fraud detection service and starts investigating

12:09 UTC: on-call team identifies the root cause and scales up the service manually

12:10 UTC: Errors have stopped and reports are processed normally

12:12 UTC: We start the process of rerunning reports that failed during the incident

13:20 UTC: All reports that did not require manual review are now completed

13:50 UTC: All reports that required manual review have been processed

Remedies

Review the autoscaling capabilities of the impacted service as well as other services that share a similar architecture.

Posted Oct 22, 2025 - 14:16 UTC

Resolved

This issue is now resolved: Increased turnaround time for document check processing in the EU region

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.

Posted Oct 20, 2025 - 13:07 UTC

Monitoring

We have implemented a fix for this issue.

We are monitoring closely to make sure the issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet, and we appreciate your patience during this incident.

We will provide an update in the next 30 minutes.

Posted Oct 20, 2025 - 12:38 UTC

Identified

The issue has been identified and a fix is being implemented.

We will provide a further update in 15 minutes.

Posted Oct 20, 2025 - 12:20 UTC

Investigating

We are currently investigating an increase in turnaround time for document check processing in the EU region.

We will provide a status update in the next 15 minutes.

Posted Oct 20, 2025 - 12:10 UTC

This incident affected: Europe (onfido.com) (Document Verification).