-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison with well-established crawlers #75
Comments
Yeah, the idea of the project is great, but it lacks so many features to make it perfect. |
absolutely all good if this isn't the right project for you - work is actively underway to keep improving the project for the custom GPTs use case and specific feedback and PRs to improve things is always highly appreciated |
Thanks Steve. I understand this is open source, I know how it works. I've made several suggestions already. I'm simply asking if it wouldn't be more productive to create an output plugin for an establish crawler, than to reinvent the crawling wheel with the only differentiating feature being rather trivial if I understand correctly (outputting the bare text extracted from an HTML element to a JSON file). |
could you suggest some examples of well established crawlers you think integration with would be better? This project is built on crawlee which is a pretty robust crawler, but certainly open to better alternatives |
How exactly is this project different from an established crawler that would just dump the HTML text into the
.html
field of a JSON array?It's got 12k stars, but it lacks basic features like canonicalizing links (see #73) or preserving links (#74).
The text was updated successfully, but these errors were encountered: