Increased latency on check creation

Incident Report for Onfido

Postmortem

Summary

One of our components contributing to automatic processing for Document Reports had a spike of timeout errors from 9.05pm until 9.20pm in the EU cluster.
All Document Reports created between 9:20pm and 9:40pm UTC were processed with a higher TaT by manual analysts.

Root Causes

Two faulty nodes in our production cluster temporarily slowed down the execution of a CPU intensive component.

Timeline

9:21pm UTC: Elevated error rates for the relevant component trigger an on-call alert.

9:28pm UTC: We identified two nodes of our cluster as culprits for slow CPU intensive executions.

9:33pm UTC: Restart the two nodes.

9:40pm UTC: The affected component recovers successfully.

9:41pm UTC: Backlog of reports observed. Public incident raised to inform customers of expected time to clear.

Posted Apr 09, 2025 - 06:51 UTC

Resolved

This incident has been resolved. A small backlog of manual tasks will be cleared within the next 1-2 hours.

Posted Mar 31, 2025 - 22:03 UTC

Monitoring

The issue has been resolved and we are monitoring the results.

Posted Mar 31, 2025 - 21:43 UTC

Investigating

We are currently experiencing an issue that is negatively impacting latency on check completion.

Posted Mar 31, 2025 - 21:41 UTC

This incident affected: Europe (onfido.com) (Document Verification).