Elevated error rate affecting check creation in EU region
Incident Report for Onfido
Postmortem

Summary

On 18th of Jan 2022, the public API deployed in EU cluster experienced degraded performances between 09:50 UTC and 9:56 UTC, and became unavailable between 09:56 UTC and 10:13 UTC.

During unavailability, new requests for any type of checks weren't accepted, and on-going checks may have experienced increased TAT.

The incident was due to a Database outage. No data was lost during the incident.

Below a table with the TaT percentiles from 9:00 UTC to 11:00 UTC 18th Jan 2022.

Report type TaT p99 (minutes) TaT p95 (minutes) TaT p50 (minutes)
Right To Work Report 68.5 53.8 1.3
Proof Of Address Report 34.2 33.7 9.8
Document Report 32.4 18.6 1.8
Facial Similarity Report  7.1 2.0 0.1
Watchlist Report 0.2 0.1 0.0
Known Faces Report 0.1 0.1 0.1

Root Causes

The Database infrastructure wasn't available, due to a scheduled change in a DB schema conflicting with a DB replication job. The adopted locking scheme caused the Database to queue indefinitely all DB schema update, and all subsequent read/write transactions. As a result, the system correctly completed operations submitted before the DB schema update, and didn't process the following ones.

Timeline

  • 09:50 UTC: The migration job was triggered.
  • 09:54 UTC: An alert was raised for the EU dashboard being down.
  • 09:55 UTC: We found that one of the main table in our database was locked.
  • 09:58 UTC: We start to investigate where the lock comes from.
  • 10:05 UTC: We found the query locking the table.
  • 10:10 UTC: We stop the job running the query.
  • 10:13 UTC: Traffic recovers back to normal.

Remedies

We implemented and deployed additional controls in the release of DB schema updates, to ensure a maximum time budget for such operations. In case the operation cannot complete successfully in a short period of time, the change won't be applied to production environment, to enable the development team to resolve and address conflicting simultaneous operations.

Posted Jan 28, 2022 - 14:54 UTC

Resolved
This issue is now resolved:

Elevated error rate affecting check creation in EU region.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
Posted Jan 18, 2022 - 10:43 UTC
Update
We have implemented a fix for this issue, we are experiencing higher turn around time.

We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.

We will provide an update at 10:45 UTC
Posted Jan 18, 2022 - 10:30 UTC
Monitoring
We have implemented a fix for this issue.

We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.
Posted Jan 18, 2022 - 10:15 UTC
Investigating
We're currently experiencing elevated error rates impacting check creation, affecting all clients in EU region.

We will provide an update at 10:15 UTC
Posted Jan 18, 2022 - 10:05 UTC
This incident affected: Europe (onfido.com) (API, Dashboard, Document Verification, Facial Similarity, Watchlist, Identity Enhanced, Right To Work).