EU Dashboard Unavailability

Incident Report for Onfido

Postmortem

Summary

The Dashboard was intermittently unavailable in the EU region from 15:30 UTC to 18:30 UTC due to an outage in the underlying data store that backs its search index.

Root Causes

A partition on the data store's shards caused increased latency on 20% of write and read operations, which ultimately timed out. This in turn prevented the Dashboard from loading whenever a timeout for these operations was reached.

Timeline

15:30 UTC: Dashboard (EU) started having intermittent usage issues

16:30 UTC: We identified the root cause in the search index data store

17:45 UTC: Fix was applied to the search index data store

18:30 UTC: Incident was resolved

Remedies

As immediate remediation the partition was manually restored, resulting in normal read/write times. The data that failed to be written during the incident period was eventually made consistent automatically.

We are committed to preventing such issues from having such an impact in the future. Therefore, we will:

Prevent issues in the data store from blocking access to unrelated Dashboard features, as a means of graceful service degradation;
Include specific alarms for shard partitions to allow for faster action by the engineering team.

Posted Oct 08, 2021 - 16:28 UTC

Resolved

This issue is now resolved.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.

Posted Sep 23, 2021 - 20:05 UTC

Monitoring

We have implemented a fix for this issue.

We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.

We will provide additional updates at 20:00 UTC.

Posted Sep 23, 2021 - 18:57 UTC

Update

We are continuing to work on a fix for this issue. An update will be provided at 19:00 UTC.

Posted Sep 23, 2021 - 18:00 UTC

Update

We are continuing to work on a fix for this issue. An update will be provided at 18:00 UTC.

Posted Sep 23, 2021 - 17:08 UTC

Identified

We have identified the root cause and are currently working on fixing the issue.
Please bear with us while we get back on our feet. We will provide additional updates at 17:00 UTC.

Posted Sep 23, 2021 - 16:36 UTC

Investigating

We're currently experiencing general unavailability impacting the Dashboard, affecting all clients in the EU region.
We are currently investigating the root cause and will provide an update at 16:30 UTC.

Posted Sep 23, 2021 - 15:52 UTC

This incident affected: Europe (onfido.com) (Dashboard).