Device Intelligence Report Failure to run
Incident Report for Onfido
Postmortem

Summary

A product change released on Device Intelligence caused reports to be withdrawn from 14:48 to 15:26 UTC January, 24. Both Studio Workflows and Classic Checks were impacted. The Device Intelligence reports running inside Studio workflows caused these to transition to the Error state. Classic Checks that included these reports were still completed based on the outcome of the remaining reports.

12% of Studio workflows running during that period were affected and 9% of Checks.

Root Causes

The primary cause of the incident was a backwards-incompatible change in a timestamp field during a Device Intelligence report calculation update. Our alerting system detected this immediately, but due to a misconfiguration, it wasn’t marked as an urgent issue. This misclassification occurred because the change only impacted a small subset of the entire report set.
Despite the alert not being flagged as urgent, one team promptly began investigating. Once they escalated the issue to the team responsible for the change, a rollback was initiated, and the issue was resolved swiftly.

Timeline

  • 14:48 GMT: Gradual deployment begins
  • 14:58 GMT: Deployment completes
  • 14:59 GMT: Internal alert triggered with non-urgent priority
  • 15:03 GMT: Investigation team begins analysis
  • 15:19 GMT: Responsible team informed and starts immediate investigation
  • 15:25 GMT: Rollback initiated
  • 15:27 GMT: Incident resolved

Remedies

  • Completed Actions:

    • Alert Priority Adjustment: Fixed the priority of alarms to mark similar issues as urgent and activating on-call;
    • Alarm Sensitivity: Tuned other alarms to be more sensitive to errors, ensuring they trigger during deployment;
  • Ongoing Actions:

    • Deployment Monitoring: Implement more granular monitoring during deployments to catch backwards-incompatible changes earlier in the process;
    • Timestamp Standardization: Develop and enforce strict guidelines for timestamp handling across all systems to prevent future compatibility issues.
    • Workflow Recovery: Implement mechanisms to recover workflows during temporary failures instead of cancelling out;
    • Device Intelligence Report Handling: Ensure Device Intelligence reports are not withdrawn but moved to a Dead Letter Queue (DLQ) for potential recovery and investigation.
Posted Jan 31, 2025 - 14:43 UTC

Resolved
All Device Intelligence reports from 14:48 to 15:26 UTC January, 24 did not complete. The device intelligence reports running on Studio workflows were withdrawn along with any other reports being executed in parallel with them.
Posted Jan 24, 2025 - 14:48 UTC