Blockchain API Partial Outage

Incident Report for WalletConnect

Postmortem

TL;DR On December 15th our Blockchain API was at only 90% availability for 10h. Multiple process failures during an infrastructure migration led to this.

Summary

The partial outage happened for ~10h. The issue was discovered by an internal operator checking metrics. It was root caused and then mitigated by three operators. Most Blockchain API users were affected.

Root Cause Blockchain API relies on Prometheus to calculate which RPC providers to use. The app was updated with a wrong Prometheus URL breaking this functionality. As a result more requests were sent to providers that couldn’t handle them or rate limited us.

5 Whys

Why was the app updated with a wrong Prometheus URL?

We had been migrating the app to a new AWS account and changed many variables including this one.

Why was this not tested in staging?

We had tested a zero downtime migration in staging which we didn’t manage to do successfully and reverted to non-zero downtime migration which we didn’t properly test in staging.

Why are you ok with migrations not being tested proper?

It was not clear that the basic functionality wasn’t tested in staging.

Why was this not caught earlier?

We did not check the metrics immediately after completing the migration.

Why was this not acted upon earlier?

We had a prior partial outage a few months ago where the URL had been updated incorrectly so we knew the impact of this. When the operator acked the issue however it was a bit unclear what was broken.

What could we have done better?

Have integration tests to ensure the communication between the service and Prometheus is working before promoting it to prod
Confirm app works including metrics in staging before going to prod
Confirm metrics work immediately after doing the migration
Have alarms for the weights metric which was broken in this case and has been broken before
Have an alarm for when the availability metric is broken

Action items

Add integration tests against Prometheus as a necessary step before promoting to prod @Xavier Basty .
Install aforementioned alarms @Xavier Basty
Write down migration checklist @Derek Rein

Posted Dec 16, 2023 - 21:01 UTC

Resolved

Blockchain API was at 90% availability for 10h. We will publish the Postmortem on Monday or before.

Posted Dec 15, 2023 - 17:00 UTC