One of our components contributing to automatic processing for Document Reports had a spike of timeout errors from 9.05pm until 9.20pm in the EU cluster.
All Document Reports created between 9:20pm and 9:40pm UTC were processed with a higher TaT by manual analysts.
Two faulty nodes in our production cluster temporarily slowed down the execution of a CPU intensive component.
9:21pm UTC: Elevated error rates for the relevant component trigger an on-call alert.
9:28pm UTC: We identified two nodes of our cluster as culprits for slow CPU intensive executions.
9:33pm UTC: Restart the two nodes.
9:40pm UTC: The affected component recovers successfully.
9:41pm UTC: Backlog of reports observed. Public incident raised to inform customers of expected time to clear.