Elevated error rate affecting document checks
Incident Report for Onfido
Postmortem

Summary

Document checks for documents with PDF417 barcodes (primarily US driving licences) erroneously returned a result of “consider”, due to flagging of data consistency (comparison between OCR extracted and barcode extracted data) between 00:00 and 3:10 UTC on July 31st. This affected customers with our default “missing barcode” configuration, which requires the barcode to be present to successfully process a check, otherwise resulting in a “caution” response.

Root Cause

At 2021-07-31 00:00 UTC, a licence for a 3rd party library utilised by the underlying barcode-extraction service expired, causing the component to fail to correctly identify and extract PDF417 barcodes. This caused upstream document checks to fail when comparing OCR extracted data and barcode extracted data, as the latter was not present.

A renewed licence had been updated within the service’s secret repository by the responsible team prior to July 31st, but the service had not been restarted to pick up the new licence values. Contributing Factors Four other factors contributed to this issue:

The barcode-extraction service swallowed the “license expired” error; while a warning message was output into service logs, the service APM tracing did not report a degradation in overall error rate. Due to this, our service-level error rate monitors did not trigger; it also lengthened the investigation as it was more difficult to pinpoint the source of discrepancy. Due to lower volume out of hours (Saturday AM), our overall result monitoring took a lengthier than expected period of time to detect and notify anomalous levels of document flagging, compared to if this incident had occurred during a higher volume period. The time taken to triage the issue and update the customer-facing status page was too long, and should have been prioritised in the response The on-call team did not have immediate access to the licence server for this component, nor earlier visibility of the potential expiry date.

Timeline

  • 00:00 UTC: the 3rd party licence expires
  • 01:27 UTC: Onfido’s on-call team is paged, as a monitor has detected anomalous results for document checks. The responding team begins an investigation.
  • 02:30 UTC: the on-call team identifies the root cause
  • 02:40 UTC: the on-call team receives a further customer escalation from product support
  • 03:00 UTC: the team identifies and tests appropriate temporary licence credentials
  • 03:10 UTC: the team deploys the temporary licence credentials across all Onfido production environments. From this point, pass rates return to normal & the underlying log errors from barcode-extraction subside.
  • 03:20 UTC: the response team adds temporary log monitoring to escalate if any issues occur with the temporary licence credentials
  • 03:50 UTC: after monitoring that service has returned to normal, the publicly-facing incident is marked as resolved
  • 09:00 UTC: the responsible team for the barcode-extraction service clarify that the licence update was previously available, and rotate the service deployment to pick up the permanent new licence credentials

Remedies

During the response, we:

  • Applied temporary licence credentials to restore the service to normal functionality
  • Added monitoring to catch any potential future degradation in service

Following on from this incident, we will:

  • Correct the erroneous handling of licence errors within the barcode-extraction component, to ensure that monitoring triggers appropriately in this failure scenario.
  • Explore our commercial arrangement for this component and identify whether we can migrate to a perpetual licensing agreement, thus simplifying ongoing maintenance
  • Explore tooling for our secret management system to proactively alert on discrepancies between secret repository and live application state
  • Review if any other system components may be subject to the same failure scenario. As we make very limited use of 3rd party, commercially-licenced libraries, it’s unlikely this scenario is applicable to other current services within our backend.
Posted Aug 19, 2021 - 14:23 UTC

Resolved
This issue is now resolved.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
Posted Jul 31, 2021 - 03:50 UTC
Monitoring
We have implemented a fix for this issue.

We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.

We will provide a further update in 30 minutes
Posted Jul 31, 2021 - 03:31 UTC
Identified
The issue has been identified and a fix is being implemented.

We will provide a further update in 15 minutes
Posted Jul 31, 2021 - 03:15 UTC
Investigating
We're currently experiencing elevated error rates impacting document checks.

We will provide an update in 30 minutes
Posted Jul 31, 2021 - 02:48 UTC
This incident affected: USA (us.onfido.com) (Document Verification), Canada (ca.onfido.com) (Document Verification), and Europe (onfido.com) (Document Verification).