-
-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow the RssCrawler to search for an RSS feed from the provided url #285
base: master
Are you sure you want to change the base?
Allow the RssCrawler to search for an RSS feed from the provided url #285
Conversation
Thanks for the PR. Could you briefly elaborate on what you mean by "provided URL"? The principal idea, though that might have been changed in practical use by others I'm just figuring now, of NP is that a user provides the root or base URLs (cf. the readme.md at the very top, "You only need to provide the root URL of the news website to crawl it completely"). |
Ah ! I guess I may have forgotten this idea. While checking the sitemap, I saw this sitemap_patterns |
Thanks for the clarification! What do you mean by "this kind of implementation", i.e., what exactly are you proposing to implement? Just trying to avoid misunderstandings. As a heads up, I won't be able to work on news-please until beginning of September, and will get back to this then. |
I mean adding a configuration option like what was done for the sitemap_patterns.
|
cc01461
to
c354720
Compare
c354720
to
1b341a7
Compare
Hello @fhamborg, sorry I'm coming back to you on this after some time To recap : some websites provide an RSS feed but it is not listed on the homepage. For example the website https://www.weka.fr/actualite/ provides an RSS feed, but you cannot find it from https://www.weka.fr/ In this PR we add a new configuration option, where we provide a list of the common pages we can find the RSS feed from.
The previous example allow to search for an RSS feed from the home page, the /blog page and the /actualite page. If the configuration is missing, it only searches from the homepage, like it currently does. Thanks for your time and input |
Hi @yldoctrine again, thanks for the PR :) Let's get to it now. I'm just wondering, how does this differ from news-please/newsplease/config/config.cfg Line 101 in 04db94c
is identical to
Or did I miss something? |
By identical, do you mean For the sitemap_patterns the goal was to be able to access directly the sitemaps when they were not listed anywhere (like in robots.txt) by providing a list of common patterns. So should we have So maybe identical they should be "working the same way" like the following
but having a set of common patterns is not the same as retrieving the information directly from the website, because then we might be missing properly built rss feed that have not a standard name. Tell me what you think |
Then perhaps I have misunderstood the purpose of Thanks :) |
Hello @fhamborg ,
Here is an improvement for the RssCrawler. Some websites provide an RSS feed that is only accessible from specific pages and I would like to be able to use the RssCrawler on those sites. The current behaviour is that the RssCrawler looks for the RSS feed from the base page.
For example the website https://www.weka.fr/actualite/ provides an RSS feed, but it is not accessible from https://www.weka.fr/
In this pull request, I'm adding a new configuration option
crawl_from_base_domain_by_rss_crawler
that by default keeps the current behaviour of checking the for an RSS feed from the base url, and if set to False will allow to search for an RSS feed from the provided urlTell me if you have any question