Summary

On June 21st 2023 from 8am to noon CET the AP region was partially down. No customers who were routed to AP based on the latency-based routing policy were able to establish pairings. Probably a few hundred customers affected.

Resolution took ~20minutes after operator acted. the 3.5h delay in acting was due to operator assuming it was a non-issue as the AP region has had some false alarms before.

What Happened

Alarms fired at 8am CET. Operator thought this is a non-issue in AP. Another operator connected to an AP websocket and when this worked also thought it’s a non issue. Initial customer reports came from customers who previously reported poor quality issues and we assume it’s because they are blocked behind Chinese firewall.

At 1130am CET, when issue further escalated team jumped on a call, investigated issue while having a quick resolution at hand, and ended up executing on quick resolution.

Resolution was to unplug AP from DNS.

Restart of AP instances resolved the issue fully. AP was then added to DNS again.

Hypothesis for root cause was built and rolled out on June 22nd 10am CET.

Root Cause

The system is deployed in US/EU/AP regions. One of the databases we use is deployed in an active/passive manner with global replication. The active instances are deployed in the EU.

An internal process to ensure our message delivery guarantee caused a spike in writes to the EU based active database instances from AP. Given the high latency from AP to EU and a low-ish number of connections to the database caused the writes to accumulate.

Hence, regular traffic was unable to retrieve responses in time for a decent user experience and in line with client-side timeouts/retries.

This was fixed by adding more connections and timeouts. We will add improvements to the internal message delivery guarantee process to not cause such traffic spikes.

What Went Well

Alarms fired
Operator acked alarm
Once issue escalated a lot of people on call
DevRel team actively helping
Quick resolution
Action items implemented quickly

What Didn’t Go Well

Misclassified alarm in the beginning as noise or mistuning in AP as this happened before

Action items

Wallet teams to provide error messages when connections fail such that we can identify/fix issues faster
- @Gancho Radkov @Jakub Witczak @Alexander Lisovik
Sample dapps and wallets consistently to be outfitted with fast client_id discovery such that investigations can proceed faster
- @Ben Kremer @Talha Ali @Alexander Lisovik
Done: Better logging around the Mailbox Gossip
Done: Alarm when the remove metric stops
Propose changes to Mailbox Gossip approach to not cause such traffic spikes

Posted Jun 24, 2023 - 09:55 UTC

Resolved

Issue resolved. We will follow up with a postmortem.

Posted Jun 21, 2023 - 10:13 UTC

Investigating

There's currently a partial outage in the Southeast Asia issue. Some customers affected. We'll keep updating here.

Posted Jun 21, 2023 - 09:44 UTC

This incident affected: Relay.