Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to include ALL urls in a Crawl output? #11

Open
wilhere opened this issue Aug 1, 2024 · 1 comment
Open

How to include ALL urls in a Crawl output? #11

wilhere opened this issue Aug 1, 2024 · 1 comment

Comments

@wilhere
Copy link

wilhere commented Aug 1, 2024

Hi, I've been testing out the CLI version of the tool and am absolutely loving it so far. It's a very solidly put together tool with Great performance and information output.

I have been trying to tweak option flags to see if I can get a particular type of behavior for a crawl and then the generated results.

What I would like to do is have the tool observe and report ANY other URL it encounters on the target website, regardless of whether they are of the same site/TLD. In other words I would like to have all external links to other sites and even IP addresses at least observed and in one of the output files.

Is this possible with any regex or argument at the moment? Not necessarily talking about going to, and crawling, all those individual URLs though. In this scenario would these perhaps end up in the sitemap based on what I'm saying?

Thanks.

@janreges
Copy link
Owner

janreges commented Aug 12, 2024

Hi @wilhere,

you have the option to specify which domains to crawl external files from, see parameter --allowed-domain-for-external-files.

Often, especially when using the offline website generation feature, it is convenient to set --allowed-domain-for-external-files=*, which ensures that all external JS/CSS/fonts, images or documents are also crawled/downloaded.

It is also possible to use the parameter --allowed-domain-for-crawling, where you can specify all domains to crawl if a link to those domains is found. You can also specify --allowed-domain-for-crawling=* with *, but this will eventually start crawling your website, then gradually other linked domains and so on ad infinitum.

Unfortunately, at the moment there is no option to say "if you find a link to an HTML page on another website, include it in the crawl, but do not crawl other pages found in the HTML code of this page" through some parameter.

In the next few days, I will try to think about how the crawler could be extended so that this behavior can be configured via parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants