Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with well-established crawlers #75

Closed
dandv opened this issue Nov 27, 2023 · 5 comments
Closed

Comparison with well-established crawlers #75

dandv opened this issue Nov 27, 2023 · 5 comments

Comments

@dandv
Copy link
Contributor

dandv commented Nov 27, 2023

How exactly is this project different from an established crawler that would just dump the HTML text into the .html field of a JSON array?

It's got 12k stars, but it lacks basic features like canonicalizing links (see #73) or preserving links (#74).

@FTAndy
Copy link

FTAndy commented Nov 28, 2023

Yeah, the idea of the project is great, but it lacks so many features to make it perfect.

@steve8708
Copy link
Contributor

absolutely all good if this isn't the right project for you - work is actively underway to keep improving the project for the custom GPTs use case and specific feedback and PRs to improve things is always highly appreciated

@dandv
Copy link
Contributor Author

dandv commented Nov 28, 2023

Thanks Steve. I understand this is open source, I know how it works. I've made several suggestions already.

I'm simply asking if it wouldn't be more productive to create an output plugin for an establish crawler, than to reinvent the crawling wheel with the only differentiating feature being rather trivial if I understand correctly (outputting the bare text extracted from an HTML element to a JSON file).

@steve8708
Copy link
Contributor

could you suggest some examples of well established crawlers you think integration with would be better?

This project is built on crawlee which is a pretty robust crawler, but certainly open to better alternatives

@razaanstha
Copy link

I guess instead of just returning plain text maybe try something like turndown so it preserves the links and converts html to markdown in well formatted way and highly customizable? You can try the demo here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants