Around 10:29 UTC, around 10 minutes after restoring service in the EU and CA from the 10:05-10:17 UTC outage, we go alerted for high error rate in the US region and we noticed the new ML model version was back online in that region. This impacted Turnaround Time (TaT) for all Document Report and Studio Autofill until 12:11 UTC.
Autofill Classic had an average 100% error rate during the same time frame.
All impacted Document Reports and Studio Autofill tasks were completed successfully with an average TaT of ~40 minutes.
A previously rolled back version of an ML model went back live automatically in the US region. The automated canary reversal did not succeed in that region, leaving the deployment in an inconsistent state. As a result, the model routing labels were not updated as expected during the next forced deployment, and traffic continued to be served by pods running the incorrect model version until we manually rolled back the model in the cluster. This was the result of a bug in our version of Helm.
10:29 UTC: The new model release went back live in the US region only, after the previous automated canary rollback
11:44 UTC: We manually re-deployed a previous version via CI/CD pipeline. This was not successful and errors continued.
12:00 UTC: We manually rolled back the version of the model directly in the cluster.
12:11 UTC: All services went back to normal