Widespread Outage
Incident Report for Reown
Postmortem

TL;DR Due to a bug in our caching policy we hit CPU limits on Supabase Cloud DB prod taking down an authentication database causing partial outages for all customers.

Events

  • The DDoS attack lasted ~45 minutes, starting at 04:45 AM CET and ending at 5:30 AM CET.
  • Relay kept Supabase busy until 07:45 AM CET
  • Total downtime was 3 hours
  • Ivan found the issue, Ilja and Tom jumped in and requested Chad to increase our CPU limits
  • Impact

    • Unhealthy relay
    • Cloud app down
    • Web3Modal requests timing out

Root Cause DDOS attack where the load was evenly distributed across VPS providers focusing on a single route /w3m/v1/getMobileListings and circumventing our cache policy by appending new query params. The Explorer API was hit ~21M times in 15 minutes.

  1. More context below 👇
* Same `projectId` used across IPs that were involved in the DDoS
* IPs flagged as threat/proxy/anonymizer by Cloudflare
* Coordinated at the same time across different servers and regions
* Focused on a single route
  1. Supabase Cloud DB got overwhelmed and hit CPU limits
  2. Relay was in a retry loop because of queries timing out, keeping Supabase Cloud DB CPU limits at its max.

What could we have done better?

  1. Escalation path wasn’t clear enough (eg: Ilja didn’t know how to escalate or page other folks)
* Override to a team member when on-call person is unavailable \(eg: flight, etc.\)

    * Non-Rota individual should have enough OpsGenie access to trigger “Escalate to All” button in case of a repeat scenario

* Rate-limiting on Explorer API

Action items

Short term

  • [x] [Cali] Publish COE (depends on the summary being more up to date)
  • [x] [Cali] Query param validation
  • [ ] [Derek] Write a guide on how to page on-call
  • [ ] [Xav to find owner] 2nd Layer Cache for Cerberus
  • [ ] [Cali] Configure query timeout on Supabase client if possible

Mid-/long-term

Posted 1 year ago. Sep 20, 2023 - 16:16 UTC

Resolved
An API we leverage for API key management was down and response times surged. This affected many upstream services consuming this API. We will share a postmortem soon.
Posted 1 year ago. Sep 16, 2023 - 04:00 UTC