Degraded upload video experience
Incident Report for Onfido
Postmortem

Summary

Starting on the 8th November 2023 from 16:17 UTC until 17:01 UTC, a small subset (~0.4%) of our customers' end-users using our Facial Similarity: Video product will have felt a degradation of the Onfido service, particularly at video upload time. Majority of applicants was able to retry and successfully go through upon this second attempt. This problem was exclusive to our EU instance.

Root Causes

Issue was related with a recent feature for checking video integrity we had tested on increasing subsets of traffic, after which it was fully rolled out (EU, US, CA). We observed CPU usage spikes when using this feature to check video integrity, which impacted the performance of upload related functionality. After we turned off the feature, we experienced CPU usage lowering down and regular service performance was re-established.

Timeline

  • 16:17 UTC: Automated monitoring triggers incident for API issues on upload video endpoint
  • 16:20 UTC: Issue goes away, dismissed as something having to do with client data (e.g., malformed video, burst of requests)
  • 16:39 UTC: Automated monitoring triggers again, internal incident open & investigation started
  • 17:01 UTC: Engineering team disabled the video integrity check feature, suspected culprit
  • 17:21 UTC: Problem is confirmed to have been mitigated
  • 17:23 UTC: Incident moves to Monitoring status
  • 18:26 UTC: Incident is deemed Resolved

Remedies

We are going to revise the logic for checking video integrity and evaluate ways of performing the check that are more efficient and that don’t cause this sort of issues. (ETA: Q4 2023)

We will additionally revise the error contention mechanism we have in place (circuit breaking) and evaluate the need for it. In this particular case, some ripple-effect early-failures were reported due to upstream service for checking video integrity being deemed “down” when in fact it was suffering from transient failures. (ETA: Q4 2023)

Posted Nov 09, 2023 - 18:22 UTC

Resolved
This incident has been resolved.
Posted Nov 08, 2023 - 18:26 UTC
Monitoring
We've taken measures to try and stabilise the system. Our metrics indicate problem has now been mitigated. We are continuing to investigate the root cause of this incident.
Posted Nov 08, 2023 - 17:23 UTC
Investigating
We are currently investigating this issue.
Posted Nov 08, 2023 - 17:01 UTC
This incident affected: Europe (onfido.com) (API, Facial Similarity, Known faces).