Partial RPC Outage
Incident Report for WalletConnect
Postmortem

TL;DR On Monday, Nov 20 2023, from 730pm CET to 1030pm CET the Blockchain API was partially down because a downstream provider was rate limiting us and a bug in our failover mechanism didn’t trigger it.

Summary

The issue was found by a WalletConnect operator and ~50% of EVM RPC calls were affected.

The issue was mitigated by fixing the failover logic and by simultaneously having the RPC provider not rate limit us (which was an issue on their end).

Root Cause We have failover logic that is based on HTTP status codes. It tracks the amount of non-HTTP status codes from a provider is input for whether the RPC provider is serving requests successfully.

The RPC provider in question doesn’t handle errors on the HTTP level but JSON RPC level so this didn’t work.

5 Whys

Why was the service down?

Because we were rate limited by a downstream RPC provider.

Why were you not handling this issue properly and failing over to another provider?

Our failover logic was based on the provider responding non HTTP200 in such cases.

Why did you not know that the provider handles this differently?

We did.

Why did you not handle this accordingly then?

We missed this during a refactor.

Why did you miss this during a refactor?

We didn’t make an assigned issue to not forget.

What could we have done better?

Action items

  1. Rewrite Pokt RPC Rate Limit into HTTP429 @Chris Smith ✅
  2. Log any other error messages temporarily to discover which other RPC providers do this @Chris Smith ✅

    1. Do an audit of these in a few weeks
  3. Wrap all calls into a Request ID and create a log span so it’s easier to discover issues for a specific request id @Max Kalashnikov #372 🏗️

Posted Nov 21, 2023 - 11:10 UTC

Resolved
Downstream RPC provider fixed the rate limit issue and our code was fixed to in the future be resilient
Posted Nov 20, 2023 - 21:33 UTC
Monitoring
A fix has been implemented and we're proceeding to deploy it
Posted Nov 20, 2023 - 21:19 UTC
Identified
One of our downstream RPC providers is wrongly rate limiting us.
Our system is design to be resilient against this - but unfortunately our exponential failover algorithm is not picking up the correct error code from the provider and hence doesn't pick up that the provider is rate limiting us.

We are currently fixing this and deploying.

If that should not work we will disable the downstream provider.
Posted Nov 20, 2023 - 20:58 UTC
This incident affected: RPC.