On 18th of Jan 2022, the public API deployed in EU cluster experienced degraded performances between 09:50 UTC and 9:56 UTC, and became unavailable between 09:56 UTC and 10:13 UTC.
During unavailability, new requests for any type of checks weren't accepted, and on-going checks may have experienced increased TAT.
The incident was due to a Database outage. No data was lost during the incident.
Below a table with the TaT percentiles from 9:00 UTC to 11:00 UTC 18th Jan 2022.
Report type | TaT p99 (minutes) | TaT p95 (minutes) | TaT p50 (minutes) |
---|---|---|---|
Right To Work Report | 68.5 | 53.8 | 1.3 |
Proof Of Address Report | 34.2 | 33.7 | 9.8 |
Document Report | 32.4 | 18.6 | 1.8 |
Facial Similarity Report | 7.1 | 2.0 | 0.1 |
Watchlist Report | 0.2 | 0.1 | 0.0 |
Known Faces Report | 0.1 | 0.1 | 0.1 |
The Database infrastructure wasn't available, due to a scheduled change in a DB schema conflicting with a DB replication job. The adopted locking scheme caused the Database to queue indefinitely all DB schema update, and all subsequent read/write transactions. As a result, the system correctly completed operations submitted before the DB schema update, and didn't process the following ones.
We implemented and deployed additional controls in the release of DB schema updates, to ensure a maximum time budget for such operations. In case the operation cannot complete successfully in a short period of time, the change won't be applied to production environment, to enable the development team to resolve and address conflicting simultaneous operations.