Facial Similarity and Known Faces reports experienced turnaround time degradation in the US region, causing service slowdowns and ~1.2% of Known Faces reports being withdrawn.
Cluster data imbalance caused sustained CPU spikes to happen in one of the supporting nodes holding a big portion of our data, leading to performance issues felt especially in turnaround time.
21:13 GMT: We’re automatically alerted by our monitoring systems for increased turnaround times in Known Faces reports, but it auto-recovers quickly
22:10 GMT: We’re once again automatically alerted by our monitoring systems for increased turnaround times across Known Faces and Facial Similarity reports
23:42 GMT: Cluster health deteriorates and then goes back to a healthy state. But we keep seeing issues and identify an action as a possible way to alleviate the issue: increasing the amount of data nodes in the cluster
23:44 GMT: We scale up the amount of data nodes in the cluster
01:37 GMT: Cluster is still unstable, but shows signs of improvement on-and-off, related with ongoing rebalancing operations
03:20 GMT: Service fully stabilised, normal functioning is re-established
The following day, the team outlined a plan for recovering stability to guarantee this wouldn’t happen again. The root cause for this issue related with a suboptimal sharding strategy of data. The main strategy involved a long migration operation to change our sharding strategy. This allows hot indices of faces (e.g., clients holding a large amount of users relative to other clients) to be better sharded across our cluster, not causing imbalances in the way data is spread across our running data nodes.