COE - April 22nd - Large-Scale Outage due to Inefficient Database Query

TL;DR

On April 22nd we experienced a larger scale outage when we deployed an inefficient query to our explorer API. This caused spillover for other APIs depending on the explorer API database. The issue affected most customers for a short amount of time and was addressed within minutes.

Exhibit 1: Spikes in HTTP500 response

Root Cause

A new feature was added to the explorer API and it was implemented using a subquery. This query worked fine when testing against the sparse data in the staging API.

‌

However, when deployed to prod this causes a lot of pressure on the database causing an outage.

‌

This outage had spillover effects to other APIs depending on the same database. Most notably the “internal” API driving project id validation in most backend services.

What went well

Many alarms fired
On-call and other team members jumped at the issue quick
Issue was resolved quick
Customers were informed via Status page

What didn’t go well

Error attribution to Cloudflare API could have been faster
Issue wasn’t caught in staging environment
Services went down despite internal API not being critical for the services to be operational

Action Items

Monitoring: integrate Cloudflare monitoring into Grafana - done
Resilience: cerberus Rust crate and services depending on it should if possible be able to operate even when internal API is down
Testability: deploy the prod dataset to the staging environment to be able to capture inefficient queries easier

Discarded Action items

Backup/Failover: Cloud API should fall back to a cold backup of the database or a 2nd database - discarded as too early

Posted May 02, 2023 - 16:50 UTC

Resolved

Resolved but still root causing

Posted Apr 22, 2023 - 11:35 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Apr 22, 2023 - 10:10 UTC

This incident affected: Relay, Verify API (for WalletConnect Sign), Blockchain API, Push Server (prev. Echo Server), and Notify Server.