-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement safe_url_string based on the URL living standard #204
Comments
Two points about Cython: first, I wonder if it's guaranteed to give comparable performance or is just an idea to try? Second, this will make w3lib a binary module, though both parsel and Scrapy already require lxml which is even more complicated. |
I think it might be reasonable to move the URL-related implementation to a separate package, to simplify contribution to the rest of w3lib - to work on w3lib itself it won't be required to have Cython installed. URL parsing according to Live Standard is a very well-defined scope, I think it could makes sense to have it separate even if it's pure Python. It'd still be a binary dependency, but wheels work well these days, and unlike lxml there are no complications like the libxml dependency (though we might think about wrapping some external library, I haven't looked into it). |
I'd say it's guaranteed that we can get comparable or better performance. With Cython you can go as close to C/C++ as you want. |
https://github.com/mypyc/mypyc could be another option besides Cython, it may allow to keep the code python-compatible (but a separate package and a binary wheel would still be worthwhile). |
#201 implemented an alternative to
safe_url_string
based on the URL living standard, while maintaining support for older standards still used in the wild, for maximum safety of the resulting URL.However, when we compared the performance of the new implementation with the old one, the new implementation was up to 200 times slower, and after some effort I only managed to reduce the gap to 40. 40 times slower is still unacceptable.
This task is about resuming the work started on #201, closing the performance gap by achieving a performance very similar or better than that of the current
safe_url_string
implementation.How to do that? At the moment we believe Cython is the way to go, i.e. rework the new implementation as Cython code, which hopefully will solve the performance issues.
The text was updated successfully, but these errors were encountered: