How to include ALL urls in a Crawl output? #11

wilhere · 2024-08-01T03:06:46Z

Hi, I've been testing out the CLI version of the tool and am absolutely loving it so far. It's a very solidly put together tool with Great performance and information output.

I have been trying to tweak option flags to see if I can get a particular type of behavior for a crawl and then the generated results.

What I would like to do is have the tool observe and report ANY other URL it encounters on the target website, regardless of whether they are of the same site/TLD. In other words I would like to have all external links to other sites and even IP addresses at least observed and in one of the output files.

Is this possible with any regex or argument at the moment? Not necessarily talking about going to, and crawling, all those individual URLs though. In this scenario would these perhaps end up in the sitemap based on what I'm saying?

Thanks.

janreges · 2024-08-12T10:10:53Z

Hi @wilhere,

you have the option to specify which domains to crawl external files from, see parameter --allowed-domain-for-external-files.

Often, especially when using the offline website generation feature, it is convenient to set --allowed-domain-for-external-files=*, which ensures that all external JS/CSS/fonts, images or documents are also crawled/downloaded.

It is also possible to use the parameter --allowed-domain-for-crawling, where you can specify all domains to crawl if a link to those domains is found. You can also specify --allowed-domain-for-crawling=* with *, but this will eventually start crawling your website, then gradually other linked domains and so on ad infinitum.

Unfortunately, at the moment there is no option to say "if you find a link to an HTML page on another website, include it in the crawl, but do not crawl other pages found in the HTML code of this page" through some parameter.

In the next few days, I will try to think about how the crawler could be extended so that this behavior can be configured via parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to include ALL urls in a Crawl output? #11

How to include ALL urls in a Crawl output? #11

wilhere commented Aug 1, 2024

janreges commented Aug 12, 2024 •

edited

Loading

How to include ALL urls in a Crawl output? #11

How to include ALL urls in a Crawl output? #11

Comments

wilhere commented Aug 1, 2024

janreges commented Aug 12, 2024 • edited Loading

janreges commented Aug 12, 2024 •

edited

Loading