On July 19th our customers experienced an increased error rate in our US instance from 13:25 to 15:42. From 13:25 to 14:50 the error rate went from ~0% to 0.5%. It then peaked to 6% until the problem was solved at 15:42.
A routine infrastructure change was introduced on July 19th at 13:10 UTC, progressively rolling out to all our systems over the next few hours. The change was incompatible with a legacy configuration of our logging system in the US instance, leading to sporadic application errors.
Aggregate error rates were initially too low to trigger an alert, delaying the time to respond. And as error logs were interrupted, this made it much more difficult to identify the source of the problem.
13:10 - Infrastructure change is published and progressive rollout starts.
13:25 - Error rate starts increasing to ~0.5%.
13:50 - According to our monitoring, a single service was impacted and a new version had just been released. The initial thought is that this new version is the culprit.
14:20 - The service is rolled back to its original version. But the high error rate isn’t fixed.
14:50 - More services start to be impacted and the error rate increases.
15:15 - The root cause is identified and a fix (rollback) is in preparation.
15:25 - The rollback is ready, preparing the release.
15:42 - The fix is fully rolled out and the error rate is back to ~0%.
Amend the incompatible US legacy observability configuration to restore consistency with our standard regional setup.