WalletConnect internal API causing widespread availability issues
Incident Report for WalletConnect
Postmortem

COE - April 22nd - Large-Scale Outage due to Inefficient Database Query

TL;DR

On April 22nd we experienced a larger scale outage when we deployed an inefficient query to our explorer API. This caused spillover for other APIs depending on the explorer API database. The issue affected most customers for a short amount of time and was addressed within minutes.

Exhibit 1: Spikes in HTTP500 response

Root Cause

A new feature was added to the explorer API and it was implemented using a subquery. This query worked fine when testing against the sparse data in the staging API.

However, when deployed to prod this causes a lot of pressure on the database causing an outage.

This outage had spillover effects to other APIs depending on the same database. Most notably the “internal” API driving project id validation in most backend services.

What went well

  • Many alarms fired
  • On-call and other team members jumped at the issue quick
  • Issue was resolved quick
  • Customers were informed via Status page

What didn’t go well

  • Error attribution to Cloudflare API could have been faster
  • Issue wasn’t caught in staging environment
  • Services went down despite internal API not being critical for the services to be operational

Action Items

  • Monitoring: integrate Cloudflare monitoring into Grafana - done
  • Resilience: cerberus Rust crate and services depending on it should if possible be able to operate even when internal API is down
  • Testability: deploy the prod dataset to the staging environment to be able to capture inefficient queries easier

Discarded Action items

  • Backup/Failover: Cloud API should fall back to a cold backup of the database or a 2nd database - discarded as too early
Posted May 02, 2023 - 16:50 UTC

Resolved
Resolved but still root causing
Posted Apr 22, 2023 - 11:35 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 22, 2023 - 10:10 UTC
This incident affected: Relay, Push Server (Echo server), Notify Server, RPC, History API (for WalletConnect Push and Chat), and Verify API (for WalletConnect Sign/Auth).