Delays in document check processing in the EU region
Incident Report for Onfido
Postmortem

Summary

At 6.06am UTC on 29 Aug 2023, one of our databases critical to our manual processing saw increased load. While we continued accepting manual tasks, their processing was strongly delayed.
The incident was mitigated at 7.32am UTC by rolling-back non-critical changes putting load on our database. Manual processing resumed with a backlog. The backlog was cleared by 2.29pm, closing the incident. Reports created between the start of the incident and the backlog disappearing saw an increased turnaround time, with up to an hour of turnaround time at the 95th percentile for checks created around 6.30am.
Checks processed automatically were not impacted.

Root Causes

We have not identified the reason why the load unexpectedly increased, as it does not match with any increase in activity or data size. However, a few things could have helped resolve the incident faster.

First, the incident wasn’t immediately routed to the responsible team, adding delay to the investigation.
Second, the system responsible for feature activations didn’t allow to immediately deactivate features as expected.

Timeline

All times UTC

06:04 GMT: The load of one of our databases increased significantly and unexpectedly
06:11 GMT: Monitoring system raise an alert and on-call support is activated. The degraded situation in manual processing is identified and more engineers are called to help. The team starts manually scaling up services to mitigate the issue
06:28 GMT: The load of the database recovers without explanation.
06:52 GMT: The situation in the database re-degrades
07:08 GMT: A non-critical query putting load on the database and introduced the day prior is identified
07:28 GMT: The change is rolled-back and the manual processing recovers back to normal throughput. The backlog of tasks would be cleared throughout the day.
14:29 GMT: Backlog fully cleared

Remedies

First, we will continue investigating what caused the increased load in our database.

Second, we will review alerts to make sure the correct team is alerted first next time.

Third, we will conduce tests with our feature-flag framework to understand why it failed to deactivate features.

Posted Sep 05, 2023 - 14:34 UTC

Resolved
This issue has been resolved & the backlog has been processed. We apologize for any inconvenience this has caused.
Posted Aug 29, 2023 - 14:29 UTC
Monitoring
The fix for the Document Checks delays has worked and overall delays are decreasing. The expected backlog for those tasks using manual processing is estimated to be 18 hours. We will update you once delays have completely subsided.
Posted Aug 29, 2023 - 07:37 UTC
Identified
Our team has identified the source of the issue with the Document Checks delays and is working to address the problem.
Posted Aug 29, 2023 - 07:30 UTC
Update
We are continuing to investigate this issue.
Posted Aug 29, 2023 - 07:16 UTC
Investigating
We're investigating reports of delays in completing Document Checks. We are working hard to make sure that completion times return to normal.
Posted Aug 29, 2023 - 07:12 UTC
This incident affected: Europe (onfido.com) (Document Verification).