TL;DR On December 15th our Blockchain API was at only 90% availability for 10h. Multiple process failures during an infrastructure migration led to this.
The partial outage happened for ~10h. The issue was discovered by an internal operator checking metrics. It was root caused and then mitigated by three operators. Most Blockchain API users were affected.
Root Cause Blockchain API relies on Prometheus to calculate which RPC providers to use. The app was updated with a wrong Prometheus URL breaking this functionality. As a result more requests were sent to providers that couldn’t handle them or rate limited us.
Why was the app updated with a wrong Prometheus URL?
We had been migrating the app to a new AWS account and changed many variables including this one.
Why was this not tested in staging?
We had tested a zero downtime migration in staging which we didn’t manage to do successfully and reverted to non-zero downtime migration which we didn’t properly test in staging.
Why are you ok with migrations not being tested proper?
It was not clear that the basic functionality wasn’t tested in staging.
Why was this not caught earlier?
We did not check the metrics immediately after completing the migration.
Why was this not acted upon earlier?
We had a prior partial outage a few months ago where the URL had been updated incorrectly so we knew the impact of this. When the operator acked the issue however it was a bit unclear what was broken.
What could we have done better?
weights metric which was broken in this case and has been broken before
availability metric is broken