After a change to an internal service used for document processing, we started seeing increased latency. While the impact on automated reports was minimal, this resulted in stress to a client service responsible for creating manual tasks, ultimately leading to slower manual processing turnaround times and a build-up of queued reports. After the problematic service was stabilized, it took some time for the report backlog to be cleared.
A bug was introduced to an internal service used for document processing, the source of the problem proving difficult to troubleshoot. Specifically, a routine feature release inherited a new API contract for a service it depends on, that conflicted with some validation logic. Due to some legacy code, the validation failures were suppressed, which masked the root cause. Despite error reporting being suppressed, the large number of exceptions being handled itself caused the service performance to degrade.
12:10 UTC: Service deployment introduces increased latency impacting downstream services.
12:24 UTC: Internal monitor alerts an unusual number of manual task creation failures.
12:40 UTC: Service observed to be under stress was manually upscaled, alleviating the immediate problem and resolving all issues related to automated processing.
13:00 UTC: A modest manual processing backlog is accumulated; latency impacting average turnaround times.
13:57 UTC: The observation of sustained degraded turnaround times for manual reports triggers a public Incident to be reported. Investigations into the root cause continue.
14:20 UTC: Most of the report backlog has been cleared. Turnaround times back to normal levels.
14:35 UTC: Root cause determined and linked to the earlier change. The service deployment is rolled back to a known stable state.