anomalous number of 5xx errors
Incident Report for Onfido
Postmortem

Summary

A database schema migration to one of Studio’s tables caused a temporary mismatch between the database status and the application side ORM model, during a deployment. Temporarily, in old instances of the application, the ORM tried to access columns that no longer existed.

As consequence, it caused a surge of errors on the application, which led to 5xx http errors accessing multiple Studio endpoints - around 23% of the traffic during a 15 minutes interval - contributing to an overall error rate of 0.52% during the whole day.

During the incident period, Workflow Runs also could not be completed and ended up in error status - 31.8% cancellation rate during the 15 minutes interval - overall cancellation rate was of 0.28% during the day.

​Timeline

08:18 - Database migration is deployed (updated application side code still not rolled out to all instances).

08:20 - Alarms triggered and Engineering starts evaluating the error.

08:31 - Issue identified.

08:33 - Issue resolved.

​Remedies

The process for reviewing and releasing database migrations will be updated to automatically detect and enforce migrations and DB schema changes that need to be done in two different steps.

Posted Nov 14, 2024 - 10:41 UTC

Resolved
studio related features were partially down due to a deployment happened this morning. The impact period is 8:18 to 8:33.
Posted Oct 18, 2024 - 08:18 UTC