Define a maximum parsing time out again #2112

lfcnassif · 2024-03-04T19:43:40Z

Current time out control actually checks the parsing progress, if there is some progress (chars written to the ContentHandler) the timeout counter is reset. This was designed to handle huge files that may take a lot of time to be parsed, when the client thread can see progress from the parsing thread.

Theoretically, a corrupted or malicious hand crafted file can cause some parser code without proper data validation to enter an infinite loop, writing text to the content handler indefinitely, that would never trigger a time out. Today, that situation is handled by the ZipBombException checking, if text written to the content handler is much bigger than the data being parsed, a ZipBombException would be thrown and parsing is interrupted.

But, there are rare situations where parsing is progressing, but very very slowly, like the one optimized on #2084 (4 days to parse a small RAR file). Another example is when a file of several GBs is wrongly detected as HTML, HTMLParser is very slow and can take days to finish.

So I think it would be worth if we check again for a total maximum parse time for each file, proportional to file size. What would be a reasonable waiting time per MB?

With the new maximum parse time out, maybe we can decrease the current progress time out. One important think to keep in mind is that, for some complex formats or large files, it usually takes some parsing time until the parser starts to output parsing results. For example, large PST or OST files take time to be copied to the temp folder, and then to parse the mailbox index structures. So we can't decrease that much the current progress time out, or we might create a third time out counter to monitor the initial parsing process, until it begins to output results...

lfcnassif · 2024-03-04T21:33:26Z

Or, we can just stop resetting the time out counter and increase it a bit, it would behave like a total maximum time out...

Ps: The problem is that parsing time depends on CPU speed...

lfcnassif added the enhancement label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define a maximum parsing time out again #2112

Define a maximum parsing time out again #2112

lfcnassif commented Mar 4, 2024 •

edited

Loading

lfcnassif commented Mar 4, 2024 •

edited

Loading

Define a maximum parsing time out again #2112

Define a maximum parsing time out again #2112

Comments

lfcnassif commented Mar 4, 2024 • edited Loading

lfcnassif commented Mar 4, 2024 • edited Loading

lfcnassif commented Mar 4, 2024 •

edited

Loading

lfcnassif commented Mar 4, 2024 •

edited

Loading