Outage in the US region

Incident Report for Onfido

Postmortem

Summary

On October 20th, our US regional instance experienced an outage over two distinct periods: from 06:48 UTC to 09:27 UTC, and again from 14:12 UTC to 18:43 UTC (fully normalizing by 19:06 UTC). This was caused by a series of major service disruptions with our infrastructure provider, AWS. During these periods, we were unable to accept and process reports, and access to supporting systems, like Dashboard, were mostly unavailable.

Root Causes

This incident was caused by a large scale AWS outage in US-EAST-1, the region where we host our US instance. The problem began with a networking failure internal to AWS that led to cascading failures across a wide range of AWS services across the region. Unfortunately, the issue impacted all available zones (AZ) in the region, which rendered our multi-AZ redundancy ineffectual.

Timeline

First outage period

06:48 UTC: API traffic into the US instance stops. No checks or workflows are being created for customers in the affected region.

06:53 UTC: Multiple on-call teams begin receiving alerts of issues in the US region.

07:07 UTC: On-call responders note errors with various internal components, including applications having problems connecting to database services.

07:16 UTC: We confirm an operational issue with our infrastructure provider, as reported on the AWS Service Health Dashboard at 07:11. The incident report refers to “increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region”, the region where we host our US instance. API Gateway is one of the many impacted services, which explains why we are not receiving any inbound traffic.

07:17 UTC: Attempts to create a public Incident are unsuccessful due to StatusPage itself being unavailable, being impacted by the same AWS problem. On-call response subsequently notifies internal customer support to make them aware.

07:21 - 09:00 UTC: Attempts to address issues with specific services, such as DynamoDB, RDS and SQS, all fail as we are unable to reliably access and administer the services. We continued to monitor progress with AWS, as well as follow internal monitors for any signs of recovery.

09:01 UTC: AWS report that they have identified a potential root cause for the incident, relating to internal networking services. This initially impacted DynamoDB but this, in turn, led to the cascade of failures across many AWS services because many of these other services themselves depend on DynamoDB.

09:22 UTC: AWS reports that they have applied initial mitigations. Shortly after, we observe signs of recovery, with inbound traffic resuming and backend services stabilizing.

09:27 UTC: Our services are once again operational and incident response transitions into post-resolution monitoring.

09:53 UTC: We publicly report the Incident after StatusPage becomes available.

10:07 UTC: Having closely monitored our systems for another 40 minutes since systems stabilized, our on-call team declares the incident closed.

Second outage period

13:39 UTC: Our on-call team begins to receive alerts of elevated errors in some internal services in the US region. Upon subsequent investigation we find that our cluster is experiencing failures when attempting to provision new EC2 nodes. This is adding strain to the system, as we are entering morning hours in the US and our internal services are prevented from auto-scaling up due to unavailable capacity. These issues are not yet resulting in significant customer impact, but latency of processing is increasing.

13:45 UTC: We publish a incident on our StatusPage alerting customers to the recurrence of issues in the US instance.

14:12 UTC: The issues escalate significantly, and our system stops processing reports. In the following couple of minutes, we also stop receiving new API requests.

14:14 UTC: AWS confirms significant API errors and connectivity issues across multiple services in the US-EAST-1 region.

15:03 UTC: AWS report they have identified the root cause as an issue originating from within the EC2 internal network, which corresponds with the EC2 provisioning failures that we continue to experience. We attempt to manual provision new nodes, but this also fails.

16:13 UTC: AWS shared that additional mitigation steps were being applied to aid recovery.

16:55 UTC: We begin receiving API traffic again, albeit with intermittent disruptions. However, backend systems remain unstable.

17:03 - 17:38 UTC: AWS continue to apply mitigation steps and begin applying a fix for EC2 launch failures, rolling this out one AZ at a time.

17:22 UTC: Our team continue to see progressive signs of recovery but some instability remains. A small rate of reports are being processed.

18:08 UTC: API traffic normalizes and report processing rates begin to increase.

18:43 - 19:06 UTC: Report processing rates begin to increase rapidly over this period, starting from approximately 50% of normal levels, back up to normal levels.

19:15 UTC: AWS provides an update confirming recovery across all AWS services, and instance launch having normalized across multiple AZs.

19:32 UTC: After seeing sustained stability over the past 30-40 minutes, our on-call team update StatusPage to report that our service has recovered.

20:19 UTC: After a further period of monitoring, we declare the incident closed.

Posted Oct 24, 2025 - 09:34 UTC

Resolved

This incident is now resolved.

We apologize for the disruption caused. A detailed postmortem will follow once we've concluded our investigations.

Posted Oct 20, 2025 - 20:19 UTC

Update

Report processing has recovered over the past 40 minutes, and all services are stabilizing.

We are continuing to monitor.

Posted Oct 20, 2025 - 19:32 UTC

Update

AWS are reporting progressive improvements. We are intermittently processing a limited number of transactions, but our services remain unstable.

Posted Oct 20, 2025 - 18:48 UTC

Monitoring

Some AWS services are showing signs of intermittent recovery, but disruption remains widespread and our services remain critically impacted.

We continue to monitor the situation.

Posted Oct 20, 2025 - 17:22 UTC

Update

In the latest update from AWS, multiple services continue to be impacted. In particular, network connectivity issues are still preventing us from servicing requests.

We continue to monitor the situation.

Posted Oct 20, 2025 - 15:13 UTC

Identified

The incident has regressed, and we are now facing another outage.

Since 14:13 UTC, we have not been able to receive requests or process reports.

We continue to monitor the situation with AWS.

Posted Oct 20, 2025 - 14:29 UTC

Update

Continued issues with our hosting provider, AWS, have resulted in recurring problems and our US service is experiencing degraded performance.

We continue to investigate this issue.

Posted Oct 20, 2025 - 13:54 UTC

Investigating

We are currently investigating this issue.

Posted Oct 20, 2025 - 13:45 UTC

This incident affected: USA (us.onfido.com) (API, Dashboard, Document Verification, Facial Similarity, Watchlist, Identity Enhanced, Webhooks, Known faces, Autofill, Device Intelligence).