TL;DR Due to a bug in our caching policy we hit CPU limits on Supabase Cloud DB prod taking down an authentication database causing partial outages for all customers.
Events
Impact
Root Cause DDOS attack where the load was evenly distributed across VPS providers focusing on a single route /w3m/v1/getMobileListings
and circumventing our cache policy by appending new query params. The Explorer API was hit ~21M times in 15 minutes.
* Same `projectId` used across IPs that were involved in the DDoS
* IPs flagged as threat/proxy/anonymizer by Cloudflare
* Coordinated at the same time across different servers and regions
* Focused on a single route
What could we have done better?
* Override to a team member when on-call person is unavailable \(eg: flight, etc.\)
* Non-Rota individual should have enough OpsGenie access to trigger “Escalate to All” button in case of a repeat scenario
* Rate-limiting on Explorer API
Short term
Mid-/long-term