Regular node does not handle two NewBlocks events in the same L1 block #252

jbearer · 2023-10-12T17:09:28Z

The commitment task included two updates in the same L1 block:
https://sepolia.etherscan.io/tx/0x1f7c64a31f523d46ad6a61a6b6ae849b27cff2de66ab3e38629fd4d420371ea9
https://sepolia.etherscan.io/tx/0x982914455cdbea0b5e89f5292275fae0d34759ab140247aad61ca5dc4e82a5df

The zkevm node did not like this, it seems to have skipped the second update, which made it think that the first update from the following L1 block contained L2 transactions from the future:
https://app.datadoghq.com/logs?query=%40host%3A%22%2Ftestnet%2Fsepolia%2F%22%20%40aws.awslogs.logStream%3A%22zap%2Fzkevm%2F011f0b577cf64142858924f600a04a35%22%20&cols=status%2C%40logger.name%2C%40error.message&event=AgAAAYsklQ5hsHUM0QAAAAAAAAAYAAAAAEFZc2tsUnRVQUFCa3BRWmRZZzM2WXdBRQAAACQAAAAAMDE4YjI0OWMtMmMxYy00NGM1LWFiZjUtMjY4ZTZkZjc4N2U4&index=%2A&messageDisplay=inline&refresh_mode=paused&saved-view-id=2130045&stream_sort=desc&viz=stream&from_ts=1697125769385&to_ts=1697125835090&live=false

jbearer · 2023-10-13T16:36:48Z

I have a hypothesis, but little evidence. I think this has to do with an(other) L1 reorg. Basically my hypothesis is that the zkevm node fetched events from Infura at a time when block 4477007 contained only one transaction from the commitment task. Then that block got replaced by one with two transactions from the commitment task, but the zkevm node had already moved onto processing block 4477008, so it missed the second transaction. But why didn't zkevm node detect that there was a reorg? This is where I'm largely guessing:

To check if there was a reorg, the synchronizer looks at the last L1 block it processed, fetches the block with the same number from the L1, and compares the hash to the hash of the block it actually processed. The hash it compares against comes from Etherman when it first fetches L1 events. But note that it fetches all the events at once, in a single RPC call, and only later, while processing them, does it fetch the full details of L1 blocks that contained events, from where it gets the hash.

So, we could have this sequence:

Commitment task sends transaction 1, it gets included in block 4477007
Etherman fetches logs from block 4477007, it gets one event
Commitment task sends transaction 2
Reorg causes block 4477007 to get replaced by a block containing both commitment transactions and both corresponding events
Etherman fetches block 4477007, gets the hash of the new block
This way the node only ever sees one event, but when it gets the hash of the block from after the reorg, so next time it goes to check reorgs, there hasn't been one.
I couldn't actually find any hard evidence that there was a reorg, but I think it's a reasonable theory. First, the fact that two updates from the commitment task ended up in the same L1 block is very strange, because the commitment task waits for each transaction to be included in a block before building more transactions. So, I think the only way this can happen if there was a reorg: transaction 1 was included in block A, the task then submitted transaction 2, block A was reorged out by a different miner who had picked up transaction 1 and now also transaction 2 from the mempool.

I don't see any recent reorgs on etherscan, but not every client is guaranteed to see every reorg, so it may just have missed it. I do notice some weirdness in the timestamps around the block in question (the double transaction was block 4477007):

4477005: 03:49:24
4477006: 03:49:36 (+12)
4477007: 03:50:00 (+24)
4477008: 03:50:24 (+24)
4477009: 03:50:36 (+36)

So, just before and after the anomaly, timestamps are increasing by 12s, as they should. But the affected block and the one right after it are each 24s after their predecessors. This is at least indicative of something weird going on with the chain.

I think a good way to test this is to just reset the (regular, non-preconf) zkevm node with a clean database and let it sync again. There will be no reorgs that affect this prefix of the chain (which is thousands of blocks deep at this point) while it is syncing, so if the issue was caused by a reorg, it should sync successfully. The thing to look for would be if it successfully gets past L1 block 4477008, HotShot block 250034.

If this is the problem, the solution would be to check the hash of the full block when we fetch against the block hash in the event log we processed. If they don't match, there has been a reorg, we can simply return an error which will trigger the synchronizer to retry, and the next time we query events we will get the state from after the reorg.

jbearer mentioned this issue Oct 13, 2023

Detect reorgs during event processing EspressoSystems/zkevm-node#89

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular node does not handle two NewBlocks events in the same L1 block #252

Regular node does not handle two NewBlocks events in the same L1 block #252

jbearer commented Oct 12, 2023

jbearer commented Oct 13, 2023

Regular node does not handle two NewBlocks events in the same L1 block #252

Regular node does not handle two NewBlocks events in the same L1 block #252

Comments

jbearer commented Oct 12, 2023

jbearer commented Oct 13, 2023