Known Faces service degradation

Incident Report for Onfido

Postmortem

Summary

Facial Similarity and Known Faces reports experienced turnaround time degradation in the US region, causing service slowdowns and ~1.2% of Known Faces reports being withdrawn.

Root Causes

Cluster data imbalance caused sustained CPU spikes to happen in one of the supporting nodes holding a big portion of our data, leading to performance issues felt especially in turnaround time.

Timeline

21:13 GMT: We’re automatically alerted by our monitoring systems for increased turnaround times in Known Faces reports, but it auto-recovers quickly

22:10 GMT: We’re once again automatically alerted by our monitoring systems for increased turnaround times across Known Faces and Facial Similarity reports

23:42 GMT: Cluster health deteriorates and then goes back to a healthy state. But we keep seeing issues and identify an action as a possible way to alleviate the issue: increasing the amount of data nodes in the cluster

23:44 GMT: We scale up the amount of data nodes in the cluster

01:37 GMT: Cluster is still unstable, but shows signs of improvement on-and-off, related with ongoing rebalancing operations

03:20 GMT: Service fully stabilised, normal functioning is re-established

Remedies

The following day, the team outlined a plan for recovering stability to guarantee this wouldn’t happen again. The root cause for this issue related with a suboptimal sharding strategy of data. The main strategy involved a long migration operation to change our sharding strategy. This allows hot indices of faces (e.g., clients holding a large amount of users relative to other clients) to be better sharded across our cluster, not causing imbalances in the way data is spread across our running data nodes.

Posted Jan 22, 2025 - 11:29 UTC

Resolved

Known Faces reports in the US were working back to normal from 3:20am UTC.
Posted Dec 27, 2024 - 10:21 UTC

Monitoring

Cluster now more stable. We're continuing to monitor the situation.
Posted Dec 27, 2024 - 01:47 UTC

Update

We've identified unusual CPU activity in two of our search cluster nodes. In the meantime, we've provisioned additional nodes to tame the negative effect of such occurrence. We are continuing to investigate the root cause for this issue.
Posted Dec 27, 2024 - 00:52 UTC

Update

We are continuing to investigate the issue.
Posted Dec 27, 2024 - 00:32 UTC

Investigating

We are currently investigating higher processing times for Known Faces reports in the US region.
Posted Dec 26, 2024 - 23:45 UTC
This incident affected: USA (us.onfido.com) (Known faces).