Elevated error rate affecting API requests in the US region

Incident Report for Onfido

Postmortem

Summary

On July 19th our customers experienced an increased error rate in our US instance from 13:25 to 15:42. From 13:25 to 14:50 the error rate went from ~0% to 0.5%. It then peaked to 6% until the problem was solved at 15:42.

Root Causes

A routine infrastructure change was introduced on July 19th at 13:10 UTC, progressively rolling out to all our systems over the next few hours. The change was incompatible with a legacy configuration of our logging system in the US instance, leading to sporadic application errors.

‌Aggregate error rates were initially too low to trigger an alert, delaying the time to respond. And as error logs were interrupted, this made it much more difficult to identify the source of the problem.

Timeline (UTC)

13:10 - Infrastructure change is published and progressive rollout starts.

13:25 - Error rate starts increasing to ~0.5%.

13:50 - According to our monitoring, a single service was impacted and a new version had just been released. The initial thought is that this new version is the culprit.

14:20 - The service is rolled back to its original version. But the high error rate isn’t fixed.

14:50 - More services start to be impacted and the error rate increases.

15:15 - The root cause is identified and a fix (rollback) is in preparation.

15:25 - The rollback is ready, preparing the release.

15:42 - The fix is fully rolled out and the error rate is back to ~0%.

Remedies

Amend the incompatible US legacy observability configuration to restore consistency with our standard regional setup.

Posted Jul 24, 2024 - 16:47 UTC

Resolved

This issue is now resolved:

Elevated error rate affecting API requests in the US region

We take pride in running a robust, reliable service and are working hard to prevent this from happening again. Once we've concluded our investigation, a detailed postmortem will follow.

Posted Jul 19, 2024 - 16:10 UTC

Monitoring

We've identified the potential issue and have implemented a fix. We will monitor in the next 30 minutes, to guarantee the fix is valid.

Posted Jul 19, 2024 - 15:58 UTC

Investigating

We're currently experiencing elevated error rates impacting all API requests in the US region.

We will provide an update in 30 minutes.

Posted Jul 19, 2024 - 15:35 UTC

This incident affected: USA (us.onfido.com) (API, Document Verification, Facial Similarity, Watchlist, Identity Enhanced, Webhooks, Known faces, Autofill).