Allow the RssCrawler to search for an RSS feed from the provided url #285

yldoctrine · 2024-08-09T13:36:20Z

Here is an improvement for the RssCrawler. Some websites provide an RSS feed that is only accessible from specific pages and I would like to be able to use the RssCrawler on those sites. The current behaviour is that the RssCrawler looks for the RSS feed from the base page.
For example the website https://www.weka.fr/actualite/ provides an RSS feed, but it is not accessible from https://www.weka.fr/

In this pull request, I'm adding a new configuration option crawl_from_base_domain_by_rss_crawler that by default keeps the current behaviour of checking the for an RSS feed from the base url, and if set to False will allow to search for an RSS feed from the provided url

Tell me if you have any question

fhamborg · 2024-08-12T12:20:13Z

Thanks for the PR. Could you briefly elaborate on what you mean by "provided URL"? The principal idea, though that might have been changed in practical use by others I'm just figuring now, of NP is that a user provides the root or base URLs (cf. the readme.md at the very top, "You only need to provide the root URL of the news website to crawl it completely").

yldoctrine · 2024-08-12T12:52:54Z

Ah ! I guess I may have forgotten this idea.
The idea was the "provided_url" would be https://www.weka.fr/actualite/ so it would access the RSS feed at https://www.weka.fr/actualite/feed instead of providing https://www.weka.fr/

While checking the sitemap, I saw this sitemap_patterns
I guess I should go for this kind of implementation then ?

fhamborg · 2024-08-12T15:14:40Z

Thanks for the clarification! What do you mean by "this kind of implementation", i.e., what exactly are you proposing to implement? Just trying to avoid misunderstandings. As a heads up, I won't be able to work on news-please until beginning of September, and will get back to this then.

yldoctrine · 2024-08-12T15:40:11Z

I mean adding a configuration option like what was done for the sitemap_patterns.
If the RSS feed is not found by the regex, we search it from the patterns in this configuration

# Set of Rss patterns
# Allow to force the check of specific RSS patterns if it's absent from the root URL
# Here is an example of definition:
# rss_patterns = [
#   "feed",
#   "rss.xml",
#   "blog/feed",
#   "blog/rss.xml"
#   "actualite/feed",
#   ]
# default: [] which means there will be no additional rss check
rss_pattern = []

yldoctrine · 2024-11-18T16:20:22Z

Hello @fhamborg, sorry I'm coming back to you on this after some time

To recap : some websites provide an RSS feed but it is not listed on the homepage. For example the website https://www.weka.fr/actualite/ provides an RSS feed, but you cannot find it from https://www.weka.fr/

In this PR we add a new configuration option, where we provide a list of the common pages we can find the RSS feed from.

rss_parent_pages = ['', 'blog', 'actualite']

The previous example allow to search for an RSS feed from the home page, the /blog page and the /actualite page.

If the configuration is missing, it only searches from the homepage, like it currently does.

Thanks for your time and input

fhamborg · 2024-11-20T08:52:47Z

Hi @yldoctrine again, thanks for the PR :) Let's get to it now. I'm just wondering, how does this differ from

news-please/newsplease/config/config.cfg

Line 101 in 04db94c

sitemap_patterns = []

(as you also noted). E.g., in my understanding both of the following would be identical:

rss_parent_pages = ['', 'blog', 'actualite']

is identical to

sitemap_patterns = ["sitemap.xml", "blog/sitemap.xml", "acualite/sitemap.xml"]

Or did I miss something?

yldoctrine · 2024-11-22T08:38:15Z

By identical, do you mean rss_parent_pages = sitemap_patterns or that they should work alike ?

For the sitemap_patterns the goal was to be able to access directly the sitemaps when they were not listed anywhere (like in robots.txt) by providing a list of common patterns.
For the rss, here we still are looking for a feed that is mentioned somewhere, so not accessing the feed directly.

So should we have rss_parent_pages = sitemap_patterns = ['', 'blog', 'actualite'], I don't think so because then the sitemaps that are not listed would not be found.

So maybe identical they should be "working the same way" like the following

rss_patterns = ["rss.xml", "blog/rss.xml", "actualite/rss.xml"]
sitemap_patterns = ["sitemap.xml", "blog/sitemap.xml", "actualite/sitemap.xml"]

but having a set of common patterns is not the same as retrieving the information directly from the website, because then we might be missing properly built rss feed that have not a standard name.

Tell me what you think

fhamborg · 2024-12-02T15:47:40Z

Then perhaps I have misunderstood the purpose of rss_parent_pages = ['', 'blog', 'actualite'] Just to avoid any misunderstanding, could you write an update post on what specifically this PR intends to do and how the PR's code achieves that? So far we have a bunch of new terms here (e.g., crawl_from_base_domain_by_rss_crawler, rss_patterns, and rss_parent_pages - only the last of which is in the actual code), I might have lost track here. Could you also please double-check if it's necessary to update the config_lib.cfg - if I understood the PR correctly (which I might have not), the newly added option would only be relevant to config.cfg

Thanks :)

yldoctrine force-pushed the allow_rss_crawler_to_start_from_url branch from cc01461 to c354720 Compare November 8, 2024 16:26

Correctly extract all Sitemaps urls from robots.txt

1b341a7

yldoctrine force-pushed the allow_rss_crawler_to_start_from_url branch from c354720 to 1b341a7 Compare November 18, 2024 16:01

Handle fallback if the parameter is not in the config

27b4a6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow the RssCrawler to search for an RSS feed from the provided url #285

Allow the RssCrawler to search for an RSS feed from the provided url #285

yldoctrine commented Aug 9, 2024

fhamborg commented Aug 12, 2024

yldoctrine commented Aug 12, 2024

fhamborg commented Aug 12, 2024

yldoctrine commented Aug 12, 2024

yldoctrine commented Nov 18, 2024

fhamborg commented Nov 20, 2024

yldoctrine commented Nov 22, 2024

fhamborg commented Dec 2, 2024

Allow the RssCrawler to search for an RSS feed from the provided url #285

Are you sure you want to change the base?

Allow the RssCrawler to search for an RSS feed from the provided url #285

Conversation

yldoctrine commented Aug 9, 2024

fhamborg commented Aug 12, 2024

yldoctrine commented Aug 12, 2024

fhamborg commented Aug 12, 2024

yldoctrine commented Aug 12, 2024

yldoctrine commented Nov 18, 2024

fhamborg commented Nov 20, 2024

yldoctrine commented Nov 22, 2024

fhamborg commented Dec 2, 2024