API Elevated Error Rate

Incident Report for Onfido

Postmortem

Summary

During the deployment of our API component on the EU cluster and for 20 minutes thereafter we observed increased latency and timeouts resulting in a partial outage of api.onfido.com from 13:02 GMT to 13:25 GMT on May 10. A configuration change in the release inadvertently reduced the capacity available to serve our traffic volume at the time.

Root Causes

The aforementioned release of our API component contained a change to the rule used to automatically scale the number of servers available that was meant to increase its reactivity and better accommodate spikes in traffic. However, when rolled out, it had the opposite effect of drastically reducing the number of available serving instances to its minimum baseline values (which are configured to support off-hours volumes and not the traffic observed at the time of the deployment).

Although we use a gradual approach with canaries to minimise risk of changes to our production environment, there isn't a practical canary or partial rollout mechanism for auto-scaling changes, which are applied immediately.

The end-to-end tests that are run in our pre-production environment with every new release weren't able to detect the issue: because auto scaling configurations were deemed valid and metric readings correctly measured it didn't cause any test failure.

This particular change set the threshold for auto-scaling to a particular high value whose effect would only be visible through load testing which wasn't performed due to the timing of the merge and deployment.

Timeline

13:02 GMT: Deployment started for the API
13:11 GMT: Monitoring systems alert for increased latency, incident opened
13:23 GMT: After investigation, deployment rolled back
13:25 GMT: Monitoring systems show normal latency ranges

Remedies

Immediate rollback of this change restored server capacity to normal values.

As follow-up actions from this incident, we will harden our deployment pipeline and improve the release process, to include:

Changes to the scaling configurations are subject to load tests in the pre-production environment prior to its release
We are currently working on a project to enable blue/green clusters; this will allow A/B testing of new scaling strategies against the current production configuration.
Consistent components for implementing scaling across applications, reducing the need for independent changes to individual applications.

Posted Jun 21, 2021 - 12:26 UTC

Resolved

This issue is now resolved.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.

Posted May 10, 2021 - 16:06 UTC

Monitoring

We have addressed the issue, and errors on the API have subsided. We will keep monitoring this issue.

Posted May 10, 2021 - 13:25 UTC

Investigating

We're investigating an elevated error rate in accessing our API. We are working hard with our team to solve the issue as fast as possible.

Posted May 10, 2021 - 13:02 UTC

This incident affected: Europe (onfido.com) (API).