-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve debuggability of chain halts #13404
Comments
Idea I have for debugging Take two data directories and inspect the last block. For both blocks (from both data sets), compare all fields, especially gas used. For the other two types, it's a bit more difficult. Perhaps we create a tool that encapsulates all of these tools into a single tool. |
For the IAVL Viewer, we've already created a tracking issue here: cosmos/iavl#567 as it was as well requested in our DEV UX calls. |
About app hash mismatch, one idea is save the the change stream of last few blocks(based on adr-038), compare with good node, can pinpoint which tx caused the mismatch. |
What I was thinking is that we have a tool/binary that is provided two data directories and returns to you debugging output for each of the three types of mismatches.
This is to give the developer or operator a high-level overview for where to start looking. In the end, further in depth analysis will still be required, especially in the case of |
@alexanderbez I'm looking into this. Do you have a way to reproduce or fetch two (or more) representative data directories, to demonstrate the usefulness of the tool? |
The initial release of the
I'm looking forward to seeing examples of |
@elias-orijtech are you able to open a pr adding the tool to the tools directory. |
I can, but I suggest keeping it out of tools until we're happy with its UX and feature set. Personally, I'd like to see more chain halts that I can analyze and use to refine the tool. |
Would love others to chime in, but personally I think adding it to sdk or a tools repo so users can see it and analyse it will get you feedback. Right now it's hard to get feedback when users don't know about it. The first feedback is this is only for iavl but the issue talks about three issues, lastresulthash, consensus hash and app hash. The tool seems to only work with app hash issues but it's unclear how a user will identify which module the issue comes from. It nots hard to recreate chain halts and test that way too. |
Then perhaps a good time to put this into wider use is when the tool can debug all three types of chain halts? That is, when this issue can be closed in favor of specific issues in the tool. If you have the time, please sketch a way to achieve realistic chain halts of the 3 interesting variants mentioned. |
I don;t think I've ever seen consensus hash issues |
One tool @yihuang worked on and used: https://github.com/crypto-com/python-iavl |
And a howto tutorial on probing app hash mismatch issues with iavl-viewer, written by @mmsqe and @JayT106 : Probe app.hash mismatch issueThe current Cosmos SDK stores the app data with the iavl tree structure. Therefore, we need to use the iavl tooling to retrieve the application data. Pre-requisites
It will require the system install the rocksdb, download rocksdb and do
MacOs
Load application.db with iaviewer (with rocksDB embed built)
Arguments details:
Modules:If you are not sure which
Compare data setUsually, you need two data sets (one data set has normal state and another data set suspect has an incorrect state, which causes the apphash mismatch) to compare to know which part might have an issue. Therefore, to probe the data sets and then compare the diff to find which keys are different. You can compare it with the chain explorer with can query the chain by the rpc calls.
we might get diff like this, it shows the accounts have different balances at height 1603101, and these are keys list:
we ignore the values because it's to represent a hash value of the value, so we are not able to know what's the real value in it. if you compare the evm module, you can know some keys are different, i.e.
So you can guess which transaction might relate to these accounts. Query data from the node through nodeThe current Tendermint (v0.34.x) will do panic when it detects the apphash mismatch state. So it's difficult to start a node service, load the problem data set, and then use rpc call to check the data status, this way you might see more details. For example, call The Tendermint v0.35.x has another way to probe the data details, currently it hasn't been integrated into the Cosmos SDK (v0.46). Ref (TODO: update how to use inspect CLI in the Cosmos SDK) |
Summary
When a chain halts we need to get as much information as possible and as easy as possible.
Problem Definition
Currently, the usual way to debug a chain halt is not that easy. We should provide tools and guides on how to debug and pinpoint the root cause of the most common failures.
Proposal
Provide tools and guides for solving:
wrong Block.Header.LastResultsHash.
wrong Block.Header.ConsensusHash.
wrong Block.Header.AppHash.
I think those are the most common errors for chain halts (besides panics).
Some stuff that comes to mind:
The text was updated successfully, but these errors were encountered: