Summary
On June 9th, the public API experienced degraded performances between 09:45 UTC and 10:05 UTC. A deployment of an internal system introduced an unexpected behaviour which prevented document reports to be processed.
In this timeframe, no data was lost, and checks which were created experienced a degradation of turn-around-time until the incident was recovered.
The deployment performed a database maintenance operation which, although the security checks and evaluations, created an exclusive lock that had to be manually fixed.
Root Causes
- A planned deployment was done on a internal service
- A database migration was released, that led to a database lock
- The on-call incident team, manually stopped and fixed the underlying issue
Timeline
(Times are in UTC)
- 09:45: The migration is deployed in production;
- 09:53: The monitoring system alerts the team that we are in a presence of a locked database;
- 10:00: The incident team manually stops the migration and fixes the underlying issue;
- 10:05: We come back to our normal database load.
Remedies
- To prevent similar incidents, ahead of applying similar migrations, active programmatic health check verifications will be put in place to check the validity of the operations that will be performed. As well as enforce a timeout in the database operations, so if any operations takes longer than the stablished time it stops executing.