TL;DR On Monday, Nov 20 2023, from 730pm CET to 1030pm CET the Blockchain API was partially down because a downstream provider was rate limiting us and a bug in our failover mechanism didn’t trigger it.
Summary
The issue was found by a WalletConnect operator and ~50% of EVM RPC calls were affected.
The issue was mitigated by fixing the failover logic and by simultaneously having the RPC provider not rate limit us (which was an issue on their end).
Root Cause We have failover logic that is based on HTTP status codes. It tracks the amount of non-HTTP status codes from a provider is input for whether the RPC provider is serving requests successfully.
The RPC provider in question doesn’t handle errors on the HTTP level but JSON RPC level so this didn’t work.
5 Whys
Why was the service down?
Because we were rate limited by a downstream RPC provider.
Why were you not handling this issue properly and failing over to another provider?
Our failover logic was based on the provider responding non HTTP200 in such cases.
Why did you not know that the provider handles this differently?
We did.
Why did you not handle this accordingly then?
We missed this during a refactor.
Why did you miss this during a refactor?
We didn’t make an assigned issue to not forget.
What could we have done better?
Action items
Log any other error messages temporarily to discover which other RPC providers do this @Chris Smith ✅
Wrap all calls into a Request ID and create a log span
so it’s easier to discover issues for a specific request id @Max Kalashnikov #372 🏗️