Please check our post here describing the incident and its reasons. In this document we will be focused on a detailed description of the steps taken to resolve the issue.
TL;DR
The Nodle Parachain is bricked after its last upgrade failed. This proposal proposes to use the forceSetCode functionality on Polkadot to ensure the prompt restart of the parachain.
Solution
The solution presented below intends to maintain the existing Parachain state, and not require to revert previously approved transactions which would break finality guarantees.
It appears everything in the recent upgrade was sound with all state transitions being correct and intended. However, the migration code associated with the upgrade is taking too long and caused the proof of verification to grow beyond what the relay chain validators tolerates. Because the migration happens right when the runtime is upgraded, this causes the parachain to halt block production as the next block would always be rejected by the Polkadot validators. As such, the head of the parachain did not progress to any undesired state, and thus can be reused.
This means that it should be possible to unbrick the parachain by force setting the current code to a new runtime that is very similar to the stored runtime. As such, a new runtime has been prepared with the changes below:
When this code is forced on the relay chain then collators can use the --wasm-runtime-overrides flag to force their nodes to use the appropriate wasm code that is recognized by the relay chain.
Testing and verification
Here are the steps we took to test this proposal:
Proposal Preimage
Our proposal is paras -> forceSetCurrentCode(para: 2026, newCode: [our good runtime], leading to the preimage hash of “0xbdeb173184c3a932473b7921c0feb233900bbe9228c54dc346239a93b186e9cc” and preimage length of 1216833 as shown in the following screenshot:
Prevention Strategy
We intend to work with the Polkadot community to improve existing testing tools to detect similar issues to the one being fixed today. Namely, we would like to extend try-runtime to at least show if the PoV for a runtime upgrade is growing beyond the 5MB limit that is set for validators. Alternatively, this could also take the form of an independent test tool as well.
Regardless of what tool is developed, we will enforce the benchmarking of any migration code in our upgrade process.
Additionally, we would like to investigate whether some patches could be contributed to the Polkadot codebase to optimize for the liveness of parachains by falling back to the last known good runtime in case of upgrade or migration failures. Because this sounds like a much bigger effort given our team expertise, we will need further discussions on this topic.
Incident Overview: The Nodle Parachain became non-functional after a failed upgrade. The issue was caused by migration code taking too long, exceeding the relay chain's tolerance.
Solution: To fix the problem, the team proposed using the forceSetCode
functionality on Polkadot to restart the parachain without reverting transactions.
New Runtime: A new runtime was prepared with the following changes:
Testing: The team tested the solution by:
forceSetCurrentCode
to apply the new runtime.Proposal: The proposal to apply the new runtime was submitted with a preimage hash and length.
Prevention: Future prevention strategies include improving testing tools to detect similar issues and enforcing benchmarking of migration code.
Threshold