Timeouts invoking Studio API & Dashboard
Incident Report for Onfido
Postmortem

Summary

Between 2024-09-06 18:40:00 and 2024-09-07 02:25:00 (UTC), a surge of requests performed by a single account, which retrieved older pages in the workflow runs dataset used to render the Workflow Results page of Onfido Dashboard, caused overhead in database resource management, which escalated and eventually impacted all Studio-related traffic requiring database access, approximately 0.2% of SDK and API traffic, and 6.3% of Dashboard related traffic.

A second instance of this incident happened between 2024-09-11 09:58:00 and 2024-09-11 11:18:00 (UTC), having similar impact as described above (0.39% of SDK and API traffic, 1.22% of Dashboard related traffic).

Root Causes

The affected request paginates the dataset using a page query parameter, which the SQL query used to retrieve the results from the database translates to LIMIT/OFFSET. The performance of these queries degrades the higher the OFFSET. The requests associated with the incident had an unusually high page value associated (2000 or higher), which the database struggled to respond to, degrading its performance in the process. Due to a misconfiguration, the database session statement timeout was ignored, which prevented the database from force-terminating the queries and freeing up resources earlier.

Remedies

​The statement timeout was correctly configured for all affected database sessions. Moreover, measures will be taken in order in order to prevent and minimize the impact of user actions.

Posted Sep 16, 2024 - 15:24 UTC

Resolved
Customers faced a partial performance degradation between 2024-09-06 18:40:00 and 2024-09-07 02:25:00 (UTC) when invoking the public API and Dashboard for the Studio component.
Posted Sep 06, 2024 - 18:30 UTC