On April 22nd we experienced a larger scale outage when we deployed an inefficient query to our explorer API. This caused spillover for other APIs depending on the explorer API database. The issue affected most customers for a short amount of time and was addressed within minutes.
Exhibit 1: Spikes in HTTP500 response
A new feature was added to the explorer API and it was implemented using a subquery. This query worked fine when testing against the sparse data in the staging API.
However, when deployed to prod this causes a lot of pressure on the database causing an outage.
This outage had spillover effects to other APIs depending on the same database. Most notably the “internal” API driving project id validation in most backend services.
cerberus Rust crate and services depending on it should if possible be able to operate even when internal API is down