Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radon struggles to collect statistics from large files #223

Open
Sam152 opened this issue Aug 10, 2021 · 2 comments
Open

Radon struggles to collect statistics from large files #223

Sam152 opened this issue Aug 10, 2021 · 2 comments

Comments

@Sam152
Copy link

Sam152 commented Aug 10, 2021

I have some large files that radon struggles to analyse. I created an example to demonstrate the problem: https://gist.githubusercontent.com/Sam152/50e8ef27cceb899084b42a069237a7b8/raw/bb21870395df86a0062c22353b532b45d31bd3f5/sample.py (~800 lines)

In my case running radon raw big-package takes 28.38s. In reality the module I'm trying to analyse has ~ 5000 lines with a similar amount of AST per line.

If I double my 800 line example, the script takes roughly 115.50s to run, so my feeling is that there might be something which scales worse than O(n) per-AST.

Any pointers if there might be something that can be optimised here, or if the nature of the analysis is such that speeding this process up is simply not possible?

Thanks in advance, if anyone can share their experience.

Cheers,
Sam


On a side note, while researching this issue, I found radon cited in an academic paper, which I thought was interesting and worth sharing (https://arxiv.org/pdf/2007.08978.pdf).

@rubik
Copy link
Owner

rubik commented Aug 26, 2021

Hi Sam, thanks for sharing the example. Indeed, it's quite surprising to see such a long run time for such a simple file.

The raw command is definitely the slowest, and that's because it does not use the ast module to parse the file, instead it uses tokenize. The latter is written in pure Python instead of C, so that's already a slowing factor. Moreover, when parsing the AST we can use efficient techniques like the visitor pattern, which are not available with the tokenize module.

However, the superlinear complexity is definitely in Radon's code. It performs some complicated operations to count logical lines, and I suspect that's where the slowest code is. I think your example highlights one of the inefficiencies particularly well.

The next step would be to profile the code. A flamegraph should already give some very useful hints. I'll try to investigate this when I've got time.

@Sam152
Copy link
Author

Sam152 commented Sep 3, 2021

Thanks for the info, that's really helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants