On June 21st 2023 from 8am to noon CET the AP region was partially down. No customers who were routed to AP based on the latency-based routing policy were able to establish pairings. Probably a few hundred customers affected.
Resolution took ~20minutes after operator acted. the 3.5h delay in acting was due to operator assuming it was a non-issue as the AP region has had some false alarms before.
Alarms fired at 8am CET. Operator thought this is a non-issue in AP. Another operator connected to an AP websocket and when this worked also thought it’s a non issue. Initial customer reports came from customers who previously reported poor quality issues and we assume it’s because they are blocked behind Chinese firewall.
At 1130am CET, when issue further escalated team jumped on a call, investigated issue while having a quick resolution at hand, and ended up executing on quick resolution.
Resolution was to unplug AP from DNS.
Restart of AP instances resolved the issue fully. AP was then added to DNS again.
Hypothesis for root cause was built and rolled out on June 22nd 10am CET.
The system is deployed in US/EU/AP regions. One of the databases we use is deployed in an active/passive manner with global replication. The active instances are deployed in the EU.
An internal process to ensure our message delivery guarantee caused a spike in writes to the EU based active database instances from AP. Given the high latency from AP to EU and a low-ish number of connections to the database caused the writes to accumulate.
Hence, regular traffic was unable to retrieve responses in time for a decent user experience and in line with client-side timeouts/retries.
This was fixed by adding more connections and timeouts. We will add improvements to the internal message delivery guarantee process to not cause such traffic spikes.
Wallet teams to provide error messages when connections fail such that we can identify/fix issues faster
Sample dapps and wallets consistently to be outfitted with fast
client_id discovery such that investigations can proceed faster
Done: Better logging around the Mailbox Gossip
Done: Alarm when the
remove metric stops
Propose changes to Mailbox Gossip approach to not cause such traffic spikes