Background
On Friday, 9 Jul 2021, the state of the wakuv2.prod fleet was as follows:
- Nodes were running
nim-wakureleasev0.4 -
nim-wakuchat2clients reported issues with accessingstorefunctionality when attempting to connect toprod. Issue reported here. Importantly, thejs-wakuclient did not have any similar issues. - All
prodfleet notes reported violations of the GossipSub backoff period that other clients has to respect before attempting a reconnection. The effect was thatprodnodes failed to connect to each other and form a mesh. - There were indications of possible SQLite DB corruption (logs indicated message storage failure,
select *queries returned unexpected results). This has not been fully investigated yet.
As far as can be established, the above-mentioned issues were present on the prod fleet for at least a week.
On the day, the chat2bridge to Matterbridge/Discord, which had been offline for about a week, was also redeployed off the latest nim-waku master. This meant that the prod fleet nodes, on top of their failure to connect to each other, didn’t support the ping keep-alive mechanism used by chat2bridge.
Steps taken on 9 Jul
It seemed likely that either DB corruption, unexpected behaviour of the deprecated keep-alive mechanism, or both caused the emergent issues. The exact cause is still the topic of an ongoing debugging investigation.
Based on the above, the following was done at around 8 AM UTC:
- As first priority, attempted to get the
prodfleet in a stable state by redeploying off latestmaster. Jenkins job here. - In parallel, tried to debug the issues.
Impact on prod fleet
The redeployment had the following effects on prod:
-
nim-wakuclients could again connect to theprodfleet and accessstorefunctionality. - Error logs related to possible DB corruption, backoff violations, and keep-alive issues disappeared.
- Connection to
chat2bridgewas not restored (this may have been related to the inconsistentPeertable).
Overall, the stability of the prod fleet was restored after the upgrade. The plan then was to continue debugging the cause of the original issues, fix connectivity to chat2bridge and communicate to clients that the fleet is usable again.
Impact on js-waku client
The upgrade changed the relay protocol ID advertised by the prod fleet to the stable /vac/waku/relay/2.0.0. Since the released version of the js-waku client does not support this protocol ID, the upgrade caused js-waku clients to fail to connect to prod. The protocol ID issue is tracked here.
Franck and external users of js-waku client, reported the regression around 8 AM UTC on Monday, 12 Jul. This blocked their progress, as they’ve previously been able to connect to prod nodes, despite the issues.
Steps taken on 12 Jul
The following steps were taken to revert the changes to the prod fleet:
- Hanno: Redeploy release
v0.4toprod - Arthur: Recreate the SQLite DB (both
PeerandMessagetable) - Arthur: Restore connectivity between
prodfleet nodes
Current state of prod fleet
The current state of the wakuv2.prod fleet:
- Nodes are running
nim-wakureleasev0.4, withrelayprotocol ID/vac/waku/relay/2.0.0-beta2 - Connectivity between nodes have been restored
- Connectivity to the
chat2bridgehas not been restored. This will require either an upgrade ofprodor a downgrade ofchat2bridge.
The redeployment and recreation of the DBs seem to have fixed the keep-alive and connectivity issues of before. js-waku clients report that they can connect to prod as before.
Lessons learned
-
Waku incident channel:
prodincidents and status updates should be clearly communicated. The#waku-networkDiscord channel could be used as “command centre” for incidents. -
Strict upgrade procedure:
produpgrades should always be done in a coordinated fashion. It requires general agreement from all clients after informing them of possible impact. -
Only run releases on
prod:prodshould only run released versions ofnim-waku, unless there is an urgent reason not to (e.g. unforeseen and critical bugs in a release, etc.)
Next steps
- Determine scope for next
nim-wakurelease. Discuss impact with other Waku v2 clients. - Upgrade
prodwith release version. - Verify that:
- [ ] All clients connect as expected to the upgraded
prodfleet - [ ] Connectivity between
prodfleet nodes is stable - [ ]
prodnodes correctly connect and relay to thechat2bridgeto Discord
- Continue investigating the original causal issues, e.g. Error: unhandled exception: Stream EOF! [LPStreamEOFError] · Issue #659 · status-im/nim-waku · GitHub, Some nodes in prod fleet seem to not relay messages · Issue #637 · status-im/nim-waku · GitHub