On Monday 4 Mar 2024 we ran a routine infrastructure component upgrade on the computing layer of our EU region. This upgrade is regularly performed with a Blue/Green strategy: a secondary compute cluster is created, all services are deployed, and traffic is then progressively switched from the primary to that secondary cluster. This process was validated on multiple test environments and had already been successfully applied to our other production regions. However, the networking configuration in the EU installation had an inconsistency, which led to traffic to our API authorization service remaining on the primary cluster. As a result, when the primary cluster was scaled down, it led to the outage of our API for 15 minutes.
During the outage, many customers accumulated requests on their side and proceeded with retries when we restored the service. This means that after the API was restored, we faced a particularly high surge of traffic. Because the new cluster was without traffic for 15 minutes, automated scaling procedures started to scale it down. As a result, a key Document processing service did not handle the traffic surge. It took more time than we aim for to scale it back up, during which document reports were unavailable.
Overall this means 15 minutes of report creation downtime, and a further 50 minutes of disruption to document report creation.