Elevated error rate affecting report creation in the EU cluster
Incident Report for Onfido
Postmortem

Summary

On September 24th from 10:33 UTC to 11:14 UTC customers faced a very high error rate when creating new document Reports.

Reports created before 10:33 UTC and non-document reports continued to be processed. For the duration of the incident only 10% of new document reports requests were accepted and processed.

Root Causes

At 10:33 we started the release of an infrastructure change. A migration from a legacy deployment system to a newer standard one.

The new configuration contained an typographical error that removed some permissions for an internal service. As a result, the service wasn’t able to interact with other infrastructure components. That service is in charge of handling documents, making any document media upload or download fail for our customers.

Timeline

  • 10:33:03 - An infrastructure change is released
  • 10:37:00 - The error rate for our internal api used to process uploaded documents reaches 80%
  • 10:38:37 - On-Call monitoring is triggered and an incident is created
  • 10:47:00 - Log analysis identifies the error and the services impacted
  • 11:03:00 - The root cause is identified and work on a fix is prepared
  • 11:10:34 - The release of the fix is triggered
  • 11:14:00 - The service is fully operational

Remedies

  • The release process of this class of infrastructure change will be hardened to reduce the feedback loop and expedite the resolution process. This will be completed before any further related changes are applied.
Posted Oct 01, 2024 - 15:39 UTC

Resolved
This issue is now resolved.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
Posted Sep 24, 2024 - 11:43 UTC
Monitoring
We have implemented a fix for this issue.

We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.
Posted Sep 24, 2024 - 11:30 UTC
Identified
The issue has been identified and a fix is being implemented.


We will provide a further update at 11:30 UTC
Posted Sep 24, 2024 - 11:17 UTC
Investigating
We're currently experiencing elevated error rates impacting report creation, affecting all clients.

We will provide an update at 11:15 UTC
Posted Sep 24, 2024 - 11:00 UTC
This incident affected: Europe (onfido.com) (Document Verification, Facial Similarity).