At 6.06am UTC on 29 Aug 2023, one of our databases critical to our manual processing saw increased load. While we continued accepting manual tasks, their processing was strongly delayed.
The incident was mitigated at 7.32am UTC by rolling-back non-critical changes putting load on our database. Manual processing resumed with a backlog. The backlog was cleared by 2.29pm, closing the incident. Reports created between the start of the incident and the backlog disappearing saw an increased turnaround time, with up to an hour of turnaround time at the 95th percentile for checks created around 6.30am.
Checks processed automatically were not impacted.
We have not identified the reason why the load unexpectedly increased, as it does not match with any increase in activity or data size. However, a few things could have helped resolve the incident faster.
First, the incident wasn’t immediately routed to the responsible team, adding delay to the investigation.
Second, the system responsible for feature activations didn’t allow to immediately deactivate features as expected.
All times UTC
06:04 GMT: The load of one of our databases increased significantly and unexpectedly
06:11 GMT: Monitoring system raise an alert and on-call support is activated. The degraded situation in manual processing is identified and more engineers are called to help. The team starts manually scaling up services to mitigate the issue
06:28 GMT: The load of the database recovers without explanation.
06:52 GMT: The situation in the database re-degrades
07:08 GMT: A non-critical query putting load on the database and introduced the day prior is identified
07:28 GMT: The change is rolled-back and the manual processing recovers back to normal throughput. The backlog of tasks would be cleared throughout the day.
14:29 GMT: Backlog fully cleared
First, we will continue investigating what caused the increased load in our database.
Second, we will review alerts to make sure the correct team is alerted first next time.
Third, we will conduce tests with our feature-flag framework to understand why it failed to deactivate features.