During the deployment of our API component on the EU cluster and for 20 minutes thereafter we observed increased latency and timeouts resulting in a partial outage of api.onfido.com from 13:02 GMT to 13:25 GMT on May 10. A configuration change in the release inadvertently reduced the capacity available to serve our traffic volume at the time.
The aforementioned release of our API component contained a change to the rule used to automatically scale the number of servers available that was meant to increase its reactivity and better accommodate spikes in traffic. However, when rolled out, it had the opposite effect of drastically reducing the number of available serving instances to its minimum baseline values (which are configured to support off-hours volumes and not the traffic observed at the time of the deployment).
Although we use a gradual approach with canaries to minimise risk of changes to our production environment, there isn't a practical canary or partial rollout mechanism for auto-scaling changes, which are applied immediately.
The end-to-end tests that are run in our pre-production environment with every new release weren't able to detect the issue: because auto scaling configurations were deemed valid and metric readings correctly measured it didn't cause any test failure.
This particular change set the threshold for auto-scaling to a particular high value whose effect would only be visible through load testing which wasn't performed due to the timing of the merge and deployment.
Immediate rollback of this change restored server capacity to normal values.
As follow-up actions from this incident, we will harden our deployment pipeline and improve the release process, to include: