Degraded performance creating checks
Incident Report for Onfido
Postmortem

Summary

On June 5th 2024, between 15:00 UTC and 16:41 UTC our non-Studio customers in EU experienced an ever increasing error rate for Check creation for 52 minutes, followed by 49 minutes of full Check creation downtime. Regions other than EU were not affected.

Subsequently, between 16:42 UTC and 18:02 UTC customers using synchronous Check results suffered timeouts for 70 minutes (webhooks were not affected).

Root Causes

On May 2nd 2024, we performed a routine operation to upgrade the ruby version used in two lambdas functions that are used for processing checks. Both functions were running on Ruby 2.7, which was deprecated by AWS on Dec 12th 2023.

Such change was deployed with success and without any impact in latency or error rate. However, the version change in the dependency system (bundler) also removed the identifier to use pessimistic versioning, which would allow a patch version change in the Lambda runtime.

Our cloud provider (AWS) ran automated upgrades to a new patch version of the Ruby 3.3 Lambda runtimes. The critical path of our Check creation process involved that Lambda function. The AWS runtime upgrade meant that new lambda instances being launched would not work because they failed to start with a new (higher) patch version of the runtime.

The same effect was seen on another Lambda function, which is responsible to handle the response to clients when a synchronous check is completed. Therefore, causing a delay in check completion because those wouldn’t complete during the request time of POST /vX/checks.

The runtime upgrades by AWS were done progressively, over an hour, making pinpointing the exact root cause complex because the failure mode was slow rather than immediate.

Timeline (in UTC)

15:00 - Cloud provider AWS starts the rollout of the upgrade of Lambda runtime ruby3.3 from v4 to v6;

15:12 - Alerts for high error rate in the lambda that orchestrates check creation;

15:15 - Investigation starts;

15:57 - Status page is updated with ongoing incident;

16:00 - AWS automated upgrade for the runtime involved in our Check creation process is completed (check creation affected reached 100%);

16:02 - The root cause is now confirmed and we start the implementation of a solution;

16:33 - Deployment of the solution (pinning the runtime to our specified version) is triggered;

16:41 - The solution to fix our Check Creation process is fully applied;

16:42 - We receive alerts for high error rate in Lambdas involved in synchronous Checks creation (for customers using synchronous checks, but not using webhooks);

17:01 - We start listing all impacted Lambdas to implement the same corrections;

17:41 - Lambdas are fixed and progressively rolled out as soon as they are ready;

18:02 - All Lambdas involved in the entire Checks creation workflow are fixed.

Remedies

The following actions have resulted from our root-cause analysis:

  • Reviewing all our Lambda functions provisioning configuration to ensure no unsupervised and automated update happens so that we can be in control of even patch version upgrades of AWS lambda runtimes;
  • Review on-call profile permissions for AWS Lambda resources;
  • Expand our run-books with some additional instructions on how to handle similar failures with Lambdas;
  • Planned in our roadmap to implement short-circuit CI/CD pipelines for on-call engineering to use, that will allow us to skip certain steps to be able to faster restore service (reduce MTTR) in situations, such as this, of full system down.
Posted Jun 10, 2024 - 15:03 UTC

Resolved
This incident has been resolved.
Posted Jun 05, 2024 - 16:58 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 05, 2024 - 16:51 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 05, 2024 - 16:33 UTC
Investigating
We are currently investigating this issue.
Posted Jun 05, 2024 - 15:57 UTC
This incident affected: Europe (onfido.com) (API).