Widespread Outage

Incident Report for Reown

Postmortem

TL;DR Due to a bug in our caching policy we hit CPU limits on Supabase Cloud DB prod taking down an authentication database causing partial outages for all customers.

Events

The DDoS attack lasted ~45 minutes, starting at 04:45 AM CET and ending at 5:30 AM CET.
Relay kept Supabase busy until 07:45 AM CET
Total downtime was 3 hours
Ivan found the issue, Ilja and Tom jumped in and requested Chad to increase our CPU limits
Impact
- Unhealthy relay
- Cloud app down
- Web3Modal requests timing out

Root Cause DDOS attack where the load was evenly distributed across VPS providers focusing on a single route /w3m/v1/getMobileListings and circumventing our cache policy by appending new query params. The Explorer API was hit ~21M times in 15 minutes.

More context below 👇

* Same `projectId` used across IPs that were involved in the DDoS
* IPs flagged as threat/proxy/anonymizer by Cloudflare
* Coordinated at the same time across different servers and regions
* Focused on a single route

Supabase Cloud DB got overwhelmed and hit CPU limits
Relay was in a retry loop because of queries timing out, keeping Supabase Cloud DB CPU limits at its max.

What could we have done better?

Escalation path wasn’t clear enough (eg: Ilja didn’t know how to escalate or page other folks)

* Override to a team member when on-call person is unavailable \(eg: flight, etc.\)

    * Non-Rota individual should have enough OpsGenie access to trigger “Escalate to All” button in case of a repeat scenario

* Rate-limiting on Explorer API

Action items

Short term

[x] [Cali] Publish COE (depends on the summary being more up to date)
[x] [Cali] Query param validation
[ ] [Derek] Write a guide on how to page on-call
[ ] [Xav to find owner] 2nd Layer Cache for Cerberus
[ ] [Cali] Configure query timeout on Supabase client if possible

Mid-/long-term

[ ] Stricter rate limiting
[ ] scale down Supabase (schedule with Postgres migration)
[ ] Create a read replica (look into https://supabase.com/docs/guides/database/replication)
[ ] Red-teaming our externally-facing services

Posted 1 year ago. Sep 20, 2023 - 16:16 UTC

Resolved

An API we leverage for API key management was down and response times surged. This affected many upstream services consuming this API. We will share a postmortem soon.

Posted 1 year ago. Sep 16, 2023 - 04:00 UTC