Between 11:47 and 12:10, 55% of documents reports could not be processed due to a fraud detection service partial failure. Starting 12:10 onwards, traffic was being processed as usual and we started rerunning failed reports. Reports that required manual processing saw additional delays of up to 2 hours in order to clear the backlog.
A sudden increase in CPU usage by the impacted fraud service lasted for a few minutes, leading to retries policies being initiated. The service did not manage to scale properly to handle both the ongoing traffic and the retries, leading to a portion of report processing being halted and stored in dead-letter queues for processing after the systems stabilize.
11:47 UTC: High CPU usage usage on a fraud detection service leads to errors
11:50 UTC: CPU usage is back to normal, reports that errored out are being retried
11:51 UTC: the fraud detection service doesn’t manage to scale to manage both the normal traffic and retries
11:51 UTC: on-call team is alerted of a high error rate on the fraud detection service and starts investigating
12:09 UTC: on-call team identifies the root cause and scales up the service manually
12:10 UTC: Errors have stopped and reports are processed normally
12:12 UTC: We start the process of rerunning reports that failed during the incident
13:20 UTC: All reports that did not require manual review are now completed
13:50 UTC: All reports that required manual review have been processed
Review the autoscaling capabilities of the impacted service as well as other services that share a similar architecture.