Studio Degradation
Incident Report for Onfido
Postmortem

Summary

On 19 November 2024 18:45 UTC, a database change on an index in the EU and US regions seriously impacted the creation and management of workflow runs via the Studio API and the SDK.

Root Causes

The addition of a column to an existing index in a core database table, aimed at improving the performance of a specific combination of filters in the Dashboard results page, was performed by first dropping the existing index, and then recreating it with the additional column. The first operation resulted in a spike in CPU overhead in all database operations involving that table, which deprioritized the second operation. The instability of the system continued until the new index was force-created.

Timeline

The timeline below refers to timestamps for the the 19 November 2024; all entries are in UTC):

  • 18:44:25 - operation dropping the index was started
  • 18:45:37 - first request fails due to statement timeout
  • 18:49:00 - alarm triggers for high surge of 5xx HTTP errors for the Studio API
  • 19:18:02 - incident was reported
  • 19:47:00 - index starting being manually force-created in the US
  • 19:57:00 - US region recovered
  • 19:59:00 - index starting being manually force-created in the EU
  • 20:09:00 - EU region recovered
  • 20:47:09 - incident resolved

Remedies

  • Integrate database migration acceptance rules and broaden list of reviewers;
  • Introduce a kill switch to enforce Studio API “maintenance mode” in order to be able to prioritize recovery actions and reduce overall Mean Time To Recovery;
  • Full split of data migration pipeline from code deployment pipeline;
  • Spilt Dashboard read-only post-execution traffic from critical path.
Posted Nov 28, 2024 - 10:27 UTC

Resolved
This incident is resolved. Post-modern with more details will be provided soon.
Posted Nov 19, 2024 - 20:43 UTC
Monitoring
A fix was deployed 15 mins ago, clients should see all systems back to normal. We're still monitoring.
Posted Nov 19, 2024 - 20:24 UTC
Identified
Clients using studio feature are seeing 5xx errors. We're restoring the service
Posted Nov 19, 2024 - 19:37 UTC
This incident affected: Europe (onfido.com) (API, Dashboard) and USA (us.onfido.com) (API, Dashboard).