Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSK stuck in handling debug_traceTransaction RPC calls #1692

Closed
gofman8 opened this issue Jan 13, 2022 · 10 comments
Closed

RSK stuck in handling debug_traceTransaction RPC calls #1692

gofman8 opened this issue Jan 13, 2022 · 10 comments

Comments

@gofman8
Copy link

gofman8 commented Jan 13, 2022

RSK node stuck during processing RPC debug_traceTransaction call. Example of request:

curl -s -X POST -H "Content-Type: application/json" -d '{ "jsonrpc": "2.0", "id": 1772093265, "method": "debug_traceTransaction", "params": [ "0xdb76b62bae3896c47fba80037f0434f3cd456218b54fc963f85e15ca0c111def", { "disableStorage": true, "disableMemory": false, "disableStack": false, "fullStorage": false } ], "origin": "http://remix.ethereum.org", "tabId": 130 }' https://mainnetdebug.sovryn.app >> /dev/null

This request works fine for some time (from few hours to days), but after time pass requests stuck on receiving response from RSK node port 4444 (rpc). Each of this stuck request left non-closed TCP session on 4444 socket.
output of netstat -anp | grep :4444
image

In this time all other requests works fine (like eth_blockNumber, web3_clientVersion etc)
But when number of open connections in CLOSE_WAIT state reaches 28. RSK stop to process any RPC requests:
Screenshot_1

So looks like RSK node stop to close socket properly during processing debug_traceTransaction RPC function which in some moment resulting in a fully nonfunctional RPC.

After RSK node restart is able to process RPC calls again.

With TRACE logs enabled for "execute,blockexecutor,web3,jsonrpc" loggers. We collected logs for the whole day and found that sometimes during processing debug_traceTransaction, it's stuck before trace log message "Go transaction 390" and "Finalize transaction 390" .

Log file for 2022-01-11 attached:
rsklog.tar.gz

cat rsk.log.txt | grep "Go transaction 390"  | wc -l 
833
cat rsk.log.txt | grep "Finalize transaction 390"  | wc -l 
737

So the last trace is here:

logger.trace("Go transaction {} {}", toBI(tx.getNonce()), tx.getHash());

Than this RPC call closed by timeout on client side and socket keep connection on 4444 port (rpc) in CLOSE_WAIT tcp session state.

Reproducing this issue is quite easy but took some time of repetitively execution of debug_traceTransaction RPC call or using Debugger in Remix.

Platform: AWS EC2
OS: Ubuntu 18.04, kernel 5.4.0-1059-aws
Instance type: m5.xlarge
RSK Network: Mainnet
RSK node configs:
logback.xml.txt
node.conf.txt

Thanks in advance and lemme know if I can assist somehow.

@shlumper
Copy link

would be nice to know which java classes we might want to enable debug level logs to have more info in the logs in regards.

@nagarev
Copy link
Contributor

nagarev commented Jan 24, 2022

Hi guys! I'm seeing that you're setting some tracing options on your request:

{ "disableStorage": true, "disableMemory": false, "disableStack": false, "fullStorage": false }

For now, this options are being ignored, but we have a PR that will be merged in a near future where we start supporting the following tracing options: disableStorage, disableMemory and disableStack. This can be a solution I think, given that you're disabling storage, it could happen that a smaller resulting request could help avoiding this hangups.

Besides all of that, I will do a couple of tests on my side, and will keep you posted on my findings.

Thanks for reporting this!

@nagarev
Copy link
Contributor

nagarev commented Jan 27, 2022

Hi guys! I have a couple of questions regarding the kind of processing you do to help me reproduce the problem:

  • I think that you're performing the request for different transactions... does this process end up somewhere in time or the idea is for it to continue indefinitely?
  • Are you doing some processing to each response? If yes, do you wait this processing to end before taking another transaction and make a new request? How much do that processing take (in average)? If there's no processing, do you have an estimate of how much time passes between two different requests?
  • Regarding "Reproducing this issue is quite easy but took some time of repetitively execution of debug_traceTransaction RPC call or using Debugger in Remix." can you explain a little how this test is performed? Do you use the same transaction for each request and repeat? If not, does the processing stops always on the same transaction?

Thanks a lot in advance!

@gofman8
Copy link
Author

gofman8 commented Jan 28, 2022

Hi! Many thanks for your reply!

  • Regarding "Reproducing this issue is quite easy but took some time of repetitively execution of debug_traceTransaction RPC call or using Debugger in Remix." can you explain a little how this test is performed? Do you use the same transaction for each request and repeat? If not, does the processing stops always on the same transaction?."

To reproduce this issue it's enough to repeat the same request. It's stop in few hours or days. I didn't found any logic here.

  • I think that you're performing the request for different transactions... does this process end up somewhere in time or the idea is for it to continue indefinitely?."

No, the same transactions. It's just continues indefinitely (http timeout error)

*Are you doing some processing to each response? If yes, do you wait this processing to end before taking another transaction and make a new request? How much do that processing take (in average)? If there's no processing, do you have an estimate of how much time passes between two different requests?"

We have tried to repeat same request per 30 seconds and one request per second, in any case RSK node stuck in some time.
You can use simple script to reproduce this issue:
while sleep 1; do curl -s -X POST -H "Content-Type: application/json" -d '{ "jsonrpc": "2.0", "id": 1772093265, "method": "debug_traceTransaction", "params": [ "0xdb76b62bae3896c47fba80037f0434f3cd456218b54fc963f85e15ca0c111def", { "disableStorage": true, "disableMemory": false, "disableStack": false, "fullStorage": false } ], "origin": "http://remix.ethereum.org", "tabId": 130 }' https://mainnetdebug.sovryn.app >> /dev/null; done
and just leave the RSK node for one day, I'm sure that time will be enough to catch this behaviour.

RSK version v3.1.0

@shlumper
Copy link

Hi nagarev,
any update on this?
thanks

@shlumper
Copy link

maybe today :) ?

@nagarev
Copy link
Contributor

nagarev commented Mar 3, 2022

Hi guys! Sorry for the delay! Yes, I've been running different tests and I've found a couple of insights regarding this, but I'm trying to figure out it completely before drawing any conclusion. In the meantime, have you guys performed some other test regarding that?

@shlumper
Copy link

no just the one described above. thnaks

@gofman8
Copy link
Author

gofman8 commented Mar 25, 2022

Hi @nagarev !
We discovered that the same issue occurs after rsk process restart...
When web3.eth.getBlockNumber calls started to work, debug transactions are not. They are stuck with exactly the same behavior...

@nagarev
Copy link
Contributor

nagarev commented May 12, 2022

Hi guys! sorry for the long delay!

I couldn't reproduce the issue, but I've done some research, please let me show you my results and conclusions:

First of all, I checked the response size to know how it changes depending on filters being set (yes, now we support some filters for the debug_traceTransaction and debug_traceBlock methods)

This are the results:

Request Type Size (MB)
Full Response 315.5
Only Memory 3.2
Only Stack 5.0
Only Storage 308.0

So you can see the storage being the main contributor to the response size (97.6% of it).

Knowing this, I've performed some stress tests on the node (basically, I've performed this request alongside other really heavy ones over an over, and let it run for a couple of hours) but couldn't observe any hangups (was not using the filters).

Maybe you can try to reproduce it again using the new version (that will use the filters you're setting) and see what happens.

I don't think I've got that last problem you've reported, regarding the getBlockNumber, does this method starts to fail when the other is stucked?

@gofman8 gofman8 closed this as completed Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants