Workflow errors and latency

Incident Report for Onfido

Postmortem

Summary

On the 11th February 2025 around 14:12 UTC, a database change on a primary key index in the EU region impacted the creation and management of workflow runs via the Studio API and the SDK.

Root Causes

While performing a schema migration on a table to change a primary key, an index was inadvertently dropped, which affected the performance of some critical operations relying on it.

The incident lasted around 15 minutes, until the index for the previous primary key was re-introduced.

The rollout of this change failed to strictly follow our established internal change management procedures for database migrations.

Timeline

13:35:35 progressive rollout of the schema migration started in US region

14:09:30 unsuccessful attempt to manually abort the rollout of the migration after an increase of P50 endpoint latency was observed in that region

14:12:00 progressive rollout of schema migration started executing it in the EU region

14:19:49: first related 500 API error was recorded due to database query timeouts

14:23:24: monitoring alarm triggers due to surge of 5XX HTTP errors in the Workflow API

14:30:58: incident was reported

14:32:00: index on the previous primary key started being re-created

14:35:29: API recovered

14:38:45: alarm recovered

14:50:47: incident closed

Remedies

  • For the foreseeable future, all Studio database migrations will require explicit review by a senior engineering leader to ensure rollout strictly follows our established internal change management procedures for database migrations;
  • Additional measures to test the impact of schema changes, prior to any production rollout, will be performed via load tests, with the goal of measuring deviations on the P50/P75 of API endpoint latency;
  • A longer progressive rollout interval across regions will be applied in order to provide more opportunity to spot issues before moving on to the higher-volume regions;
  • Improve monitoring around abnormal API endpoint latency surges in order to automatically detect deviations without requiring active human observation.
Posted Feb 14, 2025 - 10:36 UTC

Resolved

We're seeing elevated workflow errors on creation and completion for Studio customers. Slower latency in general is also observed on EU region.

incident started at 14:19 UTC and resolved at 14:36 UTC.
Posted Feb 11, 2025 - 14:00 UTC