Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a maximum parsing time out again #2112

Open
lfcnassif opened this issue Mar 4, 2024 · 1 comment
Open

Define a maximum parsing time out again #2112

lfcnassif opened this issue Mar 4, 2024 · 1 comment

Comments

@lfcnassif
Copy link
Member

lfcnassif commented Mar 4, 2024

Current time out control actually checks the parsing progress, if there is some progress (chars written to the ContentHandler) the timeout counter is reset. This was designed to handle huge files that may take a lot of time to be parsed, when the client thread can see progress from the parsing thread.

Theoretically, a corrupted or malicious hand crafted file can cause some parser code without proper data validation to enter an infinite loop, writing text to the content handler indefinitely, that would never trigger a time out. Today, that situation is handled by the ZipBombException checking, if text written to the content handler is much bigger than the data being parsed, a ZipBombException would be thrown and parsing is interrupted.

But, there are rare situations where parsing is progressing, but very very slowly, like the one optimized on #2084 (4 days to parse a small RAR file). Another example is when a file of several GBs is wrongly detected as HTML, HTMLParser is very slow and can take days to finish.

So I think it would be worth if we check again for a total maximum parse time for each file, proportional to file size. What would be a reasonable waiting time per MB?

With the new maximum parse time out, maybe we can decrease the current progress time out. One important think to keep in mind is that, for some complex formats or large files, it usually takes some parsing time until the parser starts to output parsing results. For example, large PST or OST files take time to be copied to the temp folder, and then to parse the mailbox index structures. So we can't decrease that much the current progress time out, or we might create a third time out counter to monitor the initial parsing process, until it begins to output results...

@lfcnassif
Copy link
Member Author

lfcnassif commented Mar 4, 2024

Or, we can just stop resetting the time out counter and increase it a bit, it would behave like a total maximum time out...

Ps: The problem is that parsing time depends on CPU speed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant