Increased turnaround time for check processing in the EU region

Incident Report for Onfido

Postmortem

Summary

After a change to an internal service used for document processing, we started seeing increased latency. While the impact on automated reports was minimal, this resulted in stress to a client service responsible for creating manual tasks, ultimately leading to slower manual processing turnaround times and a build-up of queued reports. After the problematic service was stabilized, it took some time for the report backlog to be cleared.

Root Causes

A bug was introduced to an internal service used for document processing, the source of the problem proving difficult to troubleshoot. Specifically, a routine feature release inherited a new API contract for a service it depends on, that conflicted with some validation logic. Due to some legacy code, the validation failures were suppressed, which masked the root cause. Despite error reporting being suppressed, the large number of exceptions being handled itself caused the service performance to degrade.

Timeline

12:10 UTC: Service deployment introduces increased latency impacting downstream services.

12:24 UTC: Internal monitor alerts an unusual number of manual task creation failures.

12:40 UTC: Service observed to be under stress was manually upscaled, alleviating the immediate problem and resolving all issues related to automated processing.

13:00 UTC: A modest manual processing backlog is accumulated; latency impacting average turnaround times.

13:57 UTC: The observation of sustained degraded turnaround times for manual reports triggers a public Incident to be reported. Investigations into the root cause continue.

14:20 UTC: Most of the report backlog has been cleared. Turnaround times back to normal levels.

14:35 UTC: Root cause determined and linked to the earlier change. The service deployment is rolled back to a known stable state.

Remedies

Fix the root cause. We'll work on ensuring the contract conflict is resolved, as well as improving exception handling logic
Review autoscaling policies to ensure the system scales accordingly in scenarios of intense resource usage
Add CPU usage metrics guardrails to canaries to ensure new deployments do not significantly impact performance

Posted Aug 20, 2025 - 13:03 UTC

Resolved

This issue is now resolved:

Increased turnaround time for check processing in the EU region.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.

Posted Aug 13, 2025 - 15:15 UTC

Monitoring

We have implemented a fix for this issue. The expected impact is scoped to the document reports that require some part of manual processing. This has created a backlog of manual reports that need processing, which will be cleared in the next 20 minutes.

We are monitoring closely to make sure the issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet, and we appreciate your patience during this incident.

Posted Aug 13, 2025 - 14:35 UTC

Update

We are continuing to work on identifying the issue.

The next update will be provided in 15 minutes.

Posted Aug 13, 2025 - 14:12 UTC

Investigating

We're currently experiencing issues that are negatively impacting turnaround time for document report processing in the EU region.

Our next update will be in 15 minutes. We appreciate your patience.

Posted Aug 13, 2025 - 13:57 UTC

This incident affected: Europe (onfido.com) (Document Verification).