Partial outage of Blockchain API (RPC)
Incident Report for WalletConnect
Postmortem

TL;DR On Nov 21 12pm CET to Nov 22 10am CET the blockchain API was partially down. Later that day, when remediating the incident, it was down for another hour.

Summary

Customers found the issue in both cases and we were internally alerted.

Root Cause On Nov 20 we had a partial outage where an RPC provider didn’t handle errors on the HTTP but JSON RPC level. One Postmortem follow up was to add logging to determine which other RPC providers do this so we prevent this issue in the future.

The logic for this ended up being flawed and resulted in an internal WARN but failed requests even with out responding HTTP.

The root cause for the issue was in the response parsing. After rolling back we rolled out the fix again, this time missing that the wrong content-type was set on the response breaking clients from properly reading the response.

What could we have done better?

  • This change shouldn’t have made it to production
  • We should have discovered this issue faster (our alarming is on the HTTP level but we weren’t responding HTTP here)
  • We should have been alerted to both issues before customers found out

Action items

  1. Make sure other issues of this kind respond HTTP @Chris Smith
  2. ~~Extend integration tests to cover this type of request~~
  3. Find a bug fix for the parsing and install the logs again ✅
  4. Use RPC in Canary so we are alerted to such issues before customers find out e.g. an e2e UI canary for web3modal
  5. Ensure integration tests check the content-type of the response
Posted Nov 23, 2023 - 04:49 UTC

Resolved
Fix was deployed. We will publish the postmortem shortly
Posted Nov 22, 2023 - 10:34 UTC
Identified
We _think_ we found the culprit and are rolling back.
Posted Nov 22, 2023 - 10:22 UTC
Investigating
We are currently investigating this issue.
Posted Nov 22, 2023 - 10:04 UTC
This incident affected: RPC.