Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve debuggability of chain halts #13404

Open
Tracked by #14566
facundomedica opened this issue Sep 27, 2022 · 13 comments
Open
Tracked by #14566

Improve debuggability of chain halts #13404

facundomedica opened this issue Sep 27, 2022 · 13 comments
Labels
T: Dev UX UX for SDK developers (i.e. how to call our code) tooling dev tooling within the sdk

Comments

@facundomedica
Copy link
Member

Summary

When a chain halts we need to get as much information as possible and as easy as possible.

Problem Definition

Currently, the usual way to debug a chain halt is not that easy. We should provide tools and guides on how to debug and pinpoint the root cause of the most common failures.

Proposal

Provide tools and guides for solving:

  • wrong Block.Header.LastResultsHash.
  • wrong Block.Header.ConsensusHash.
  • wrong Block.Header.AppHash.

I think those are the most common errors for chain halts (besides panics).

Some stuff that comes to mind:

  • Command to obtain a specific block (inputs and results) along with hashes, txs, etc. (even when the chain is halted)
  • Dump to a file info that might be helpful
  • Improved iavl-viewer + detailed how-to (maybe iterate over this https://github.com/cosmos/iavl/tree/master/cmd/iaviewer)
  • Remote debugging tools (like running iavl-viewer over http instead of having to pull the entire state?) so validators can give read-only access to chain devs if they desire to do so.
@facundomedica facundomedica added the T: Dev UX UX for SDK developers (i.e. how to call our code) label Sep 27, 2022
@alexanderbez
Copy link
Contributor

Idea I have for debugging LastResultsHash:

Take two data directories and inspect the last block. For both blocks (from both data sets), compare all fields, especially gas used.

For the other two types, it's a bit more difficult. Perhaps we create a tool that encapsulates all of these tools into a single tool.

@tac0turtle tac0turtle added the tooling dev tooling within the sdk label Oct 2, 2022
@julienrbrt
Copy link
Member

For the IAVL Viewer, we've already created a tracking issue here: cosmos/iavl#567 as it was as well requested in our DEV UX calls.

@yihuang
Copy link
Collaborator

yihuang commented Oct 3, 2022

About app hash mismatch, one idea is save the the change stream of last few blocks(based on adr-038), compare with good node, can pinpoint which tx caused the mismatch.

@alexanderbez
Copy link
Contributor

alexanderbez commented Oct 3, 2022

What I was thinking is that we have a tool/binary that is provided two data directories and returns to you debugging output for each of the three types of mismatches.

  • For LastResultsHash it inspects the gas used between the two states and reports a diff if any
  • For ConsensusHash it inspects the entire structures between the two states and reports a diff if any
  • For AppHash it reports the module hashes of each module between the two states and execution order of txs and begin/end block

This is to give the developer or operator a high-level overview for where to start looking. In the end, further in depth analysis will still be required, especially in the case of AppHash mismatch.

@eliasnaur
Copy link

@alexanderbez I'm looking into this. Do you have a way to reproduce or fetch two (or more) representative data directories, to demonstrate the usefulness of the tool?

@elias-orijtech
Copy link
Contributor

The initial release of the chdbg tool can now help diagnose ConsensusHash differences:

go run github.com/orijtech/chdbg bns-a.db bns-b.db
chdbg: hash mismatch: 96AAD58DBDF2BA87D90BE1F620E80AC3D1662B5113A7667B51303596163A5969 != 56E581EBD9C0A3D726A91579839F7FF8A9251BEB063FDF0FA0415A0B3429DF6E
chdbg: key _i.bchnft_owner:4C97A7423B1782D7C8CAB362247B848DEC96B1EC: key proofs differ
chdbg: key _i.bchnft_owner:E28AE9A6EB94FC88B73EB7CBD6B87BF93EB9BEF0: key proofs differ
chdbg: key _i.tkrnft_owner:E28AE9A6EB94FC88B73EB7CBD6B87BF93EB9BEF0: key proofs differ
chdbg: key _i.usrnft_chainaddr:1152542575310734325L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:12256717727036376470L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:14285752342776807606L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:177168082075485743L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:2980033962229439650L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:3070406526139113375L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:8565302995323734695L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdgb: ... (additional diffs omitted)
chdbg: database mismatch at version 190258 with 88 differences
exit status 2

I'm looking forward to seeing examples of LastResultsHash or AppHash mismatches to develop the tool further.

@tac0turtle
Copy link
Member

@elias-orijtech are you able to open a pr adding the tool to the tools directory.

@elias-orijtech
Copy link
Contributor

I can, but I suggest keeping it out of tools until we're happy with its UX and feature set. Personally, I'd like to see more chain halts that I can analyze and use to refine the tool.

@tac0turtle
Copy link
Member

Would love others to chime in, but personally I think adding it to sdk or a tools repo so users can see it and analyse it will get you feedback. Right now it's hard to get feedback when users don't know about it.

The first feedback is this is only for iavl but the issue talks about three issues, lastresulthash, consensus hash and app hash. The tool seems to only work with app hash issues but it's unclear how a user will identify which module the issue comes from. It nots hard to recreate chain halts and test that way too.

@elias-orijtech
Copy link
Contributor

Then perhaps a good time to put this into wider use is when the tool can debug all three types of chain halts? That is, when this issue can be closed in favor of specific issues in the tool.

If you have the time, please sketch a way to achieve realistic chain halts of the 3 interesting variants mentioned.

@ValarDragon
Copy link
Contributor

ValarDragon commented Dec 13, 2022

  • LastResultsHash: Just increment gas in some message, and attempt to replay on same network
  • AppHash: use random to change the value you write to a store

I don;t think I've ever seen consensus hash issues

@tac0turtle tac0turtle mentioned this issue Jan 10, 2023
6 tasks
@tomtau
Copy link
Contributor

tomtau commented Jan 11, 2023

One tool @yihuang worked on and used: https://github.com/crypto-com/python-iavl
it's a step up from iavl-viewer

@tomtau
Copy link
Contributor

tomtau commented Jan 11, 2023

And a howto tutorial on probing app hash mismatch issues with iavl-viewer, written by @mmsqe and @JayT106 :

Probe app.hash mismatch issue

The current Cosmos SDK stores the app data with the iavl tree structure. Therefore, we need to use the iavl tooling to retrieve the application data.

Pre-requisites

  1. install iaviewer
  2. Check the backend db type: i.e. rocksdb, leveldb, or others.
  3. The default iaviewer is using leveldb as the backend db, to build rocksdb:
    use customized branch https://github.com/JayT106/iavl/tree/rocksdb-support (TODO: clean up and submit PR to the upstream repo) and run
CGO_CFLAGS="-I/usr/local/include/rocksdb" CGO_LDFLAGS="-L/usr/local/lib/rocksdb -lrocksdb -lstdc++ -lm -lz -lbz2 -lsnappy -llz4 -lzstd"\
go build -tags rocksdb ./cmd/iaviewer/

It will require the system install the rocksdb, download rocksdb and do

make install

MacOs

  1. install rocksdb 6.29.3
curl -O https://raw.githubusercontent.com/Homebrew/homebrew-core/5a6d7658c8686b3326f69c5dd11d08800586ad9c/Formula/rocksdb.rb && brew install rocksdb.rb
  1. build with -tags rocksdb
CGO_CFLAGS="-I/usr/local/Cellar/rocksdb/6.29.3/include" \
CGO_LDFLAGS="-L/usr/local/Cellar/rocksdb/6.29.3/lib -lrocksdb -lstdc++ -lm -lz -lbz2 -lsnappy -llz4 -lzstd -L/usr/local/Cellar/snappy/1.1.9/lib -L/usr/local/Cellar/lz4/1.9.3/lib/ -L /usr/local/Cellar/zstd/1.5.2/lib/"  \
go build -tags rocksdb ./cmd/iaviewer/

Load application.db with iaviewer (with rocksDB embed built)

./iaviewer [data/shape/versions/balance/nonce] [application.db path] [s/k:module name/] <version> <addr>

i.e.
export VER=1600000
export ADDR=57B4B1d6ecC292910840CEdeDE87884b254d4738
./iaviewer balance /chain/.cronosd/data/application.db/ "s/k:bank/" $VER $ADDR

Arguments details:

  • data: iterator the database and returns the key and the hash of value sets, version argument requires
  • shape: returns the tree shape, version argument requires
  • versions: returns available versions(block height) in the database
  • balance: returns the balance given the acc address, for bank module only, addr argument requires
  • nonce: returns the nonce given the acc address, for acc module only, addr argument requires

Modules:

If you are not sure which storekey to use for every module in the project or the cosmos SDK, usually you can find it in x/[module]/key.go in the project or the cosmos sdk
Therefore, there are some modules we can check the data status:

"s/k:bank/" "s/k:evm/" "s/k:acc/" "s/k:ibc/"

Compare data set

Usually, you need two data sets (one data set has normal state and another data set suspect has an incorrect state, which causes the apphash mismatch) to compare to know which part might have an issue. Therefore, to probe the data sets and then compare the diff to find which keys are different. You can compare it with the chain explorer with can query the chain by the rpc calls.

./iaviewer data /data2/data/application.db/ $MOD $VER > data_ibc_control ; ./iaviewer data /chain/.cronosd/data/application.db/ $MOD $VER > data_ibc_normal

diff data_ibc_control data_ibc_normal

we might get diff like this, it shows the accounts have different balances at height 1603101, and these are keys list:

02143299D5EEE1934480072E21D6747DABE7B4D4A73D6261736563726F
02143B368AF83F84A63E7A1E56715EBAAA9351A6DABD6261736563726F
02145C7F8A570D578ED84E63FDFA7B1EE72DEAE1AE236261736563726F
0214F1829676DB577682E944FC3493D451B67FF3E29F6261736563726F

we ignore the values because it's to represent a hash value of the value, so we are not able to know what's the real value in it.
the key 02143299D5EEE1934480072E21D6747DABE7B4D4A73D6261736563726F is combined with prefix, address, and denom. You need to check the implementation to know how it be stored. 0214 - prefix, 3299D5EEE1934480072E21D6747DABE7B4D4A73D - account, 6261736563726F - denom (this case is basecro)

if you compare the evm module, you can know some keys are different, i.e.

WBTC
0x062E66477Faf219F25D27dCED647BF57C3107d52
Crona LPs (Crona-LP)
0x285a569EDD6210a0410883d2E29471A6B0c7790d
Wrapped CRO (WCRO)
0x5C7F8A570d578ED84E63fdFA7b1eE72dEae1AE23
Crona LPs (Crona-LP)
0x5cc953f278bf6908B2632c65D6a202D6fd1370f9
Crona LPs (Crona-LP)
0xb4684F52867dC0dDe6F931fBf6eA66Ce94666860
USD Coin (USDC, CronosCRC20)
0xc21223249CA28397B4B6541dfFaEcC539BfF0c59
Wrapped Ether (WETH)
0xe44Fd7fCb2b1581822D0c862B68222998a0c299a

So you can guess which transaction might relate to these accounts.

Query data from the node through node

The current Tendermint (v0.34.x) will do panic when it detects the apphash mismatch state. So it's difficult to start a node service, load the problem data set, and then use rpc call to check the data status, this way you might see more details. For example, call eth_getTransactionReceipt for getting the transaction result of the problem data set. Currently, we hacked the tendermint by passing the handshake step. you can build the application with this

The Tendermint v0.35.x has another way to probe the data details, currently it hasn't been integrated into the Cosmos SDK (v0.46). Ref (TODO: update how to use inspect CLI in the Cosmos SDK)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T: Dev UX UX for SDK developers (i.e. how to call our code) tooling dev tooling within the sdk
Projects
None yet
Development

No branches or pull requests

9 participants