Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the RssCrawler to search for an RSS feed from the provided url #285

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yldoctrine
Copy link
Contributor

Hello @fhamborg ,

Here is an improvement for the RssCrawler. Some websites provide an RSS feed that is only accessible from specific pages and I would like to be able to use the RssCrawler on those sites. The current behaviour is that the RssCrawler looks for the RSS feed from the base page.
For example the website https://www.weka.fr/actualite/ provides an RSS feed, but it is not accessible from https://www.weka.fr/

In this pull request, I'm adding a new configuration option crawl_from_base_domain_by_rss_crawler that by default keeps the current behaviour of checking the for an RSS feed from the base url, and if set to False will allow to search for an RSS feed from the provided url

Tell me if you have any question

@fhamborg
Copy link
Owner

Thanks for the PR. Could you briefly elaborate on what you mean by "provided URL"? The principal idea, though that might have been changed in practical use by others I'm just figuring now, of NP is that a user provides the root or base URLs (cf. the readme.md at the very top, "You only need to provide the root URL of the news website to crawl it completely").

@yldoctrine
Copy link
Contributor Author

Ah ! I guess I may have forgotten this idea.
The idea was the "provided_url" would be https://www.weka.fr/actualite/ so it would access the RSS feed at https://www.weka.fr/actualite/feed instead of providing https://www.weka.fr/

While checking the sitemap, I saw this sitemap_patterns
I guess I should go for this kind of implementation then ?

@fhamborg
Copy link
Owner

Thanks for the clarification! What do you mean by "this kind of implementation", i.e., what exactly are you proposing to implement? Just trying to avoid misunderstandings. As a heads up, I won't be able to work on news-please until beginning of September, and will get back to this then.

@yldoctrine
Copy link
Contributor Author

I mean adding a configuration option like what was done for the sitemap_patterns.
If the RSS feed is not found by the regex, we search it from the patterns in this configuration

# Set of Rss patterns
# Allow to force the check of specific RSS patterns if it's absent from the root URL
# Here is an example of definition:
# rss_patterns = [
#   "feed",
#   "rss.xml",
#   "blog/feed",
#   "blog/rss.xml"
#   "actualite/feed",
#   ]
# default: [] which means there will be no additional rss check
rss_pattern = []

@yldoctrine yldoctrine force-pushed the allow_rss_crawler_to_start_from_url branch from cc01461 to c354720 Compare November 8, 2024 16:26
@yldoctrine yldoctrine force-pushed the allow_rss_crawler_to_start_from_url branch from c354720 to 1b341a7 Compare November 18, 2024 16:01
@yldoctrine
Copy link
Contributor Author

Hello @fhamborg, sorry I'm coming back to you on this after some time

To recap : some websites provide an RSS feed but it is not listed on the homepage. For example the website https://www.weka.fr/actualite/ provides an RSS feed, but you cannot find it from https://www.weka.fr/

In this PR we add a new configuration option, where we provide a list of the common pages we can find the RSS feed from.

rss_parent_pages = ['', 'blog', 'actualite']

The previous example allow to search for an RSS feed from the home page, the /blog page and the /actualite page.

If the configuration is missing, it only searches from the homepage, like it currently does.

Thanks for your time and input

@fhamborg
Copy link
Owner

Hi @yldoctrine again, thanks for the PR :) Let's get to it now. I'm just wondering, how does this differ from

sitemap_patterns = []
(as you also noted). E.g., in my understanding both of the following would be identical:

rss_parent_pages = ['', 'blog', 'actualite']

is identical to

sitemap_patterns = ["sitemap.xml", "blog/sitemap.xml", "acualite/sitemap.xml"]

Or did I miss something?

@yldoctrine
Copy link
Contributor Author

By identical, do you mean rss_parent_pages = sitemap_patterns or that they should work alike ?

For the sitemap_patterns the goal was to be able to access directly the sitemaps when they were not listed anywhere (like in robots.txt) by providing a list of common patterns.
For the rss, here we still are looking for a feed that is mentioned somewhere, so not accessing the feed directly.

So should we have rss_parent_pages = sitemap_patterns = ['', 'blog', 'actualite'], I don't think so because then the sitemaps that are not listed would not be found.

So maybe identical they should be "working the same way" like the following

rss_patterns = ["rss.xml", "blog/rss.xml", "actualite/rss.xml"]
sitemap_patterns = ["sitemap.xml", "blog/sitemap.xml", "actualite/sitemap.xml"]

but having a set of common patterns is not the same as retrieving the information directly from the website, because then we might be missing properly built rss feed that have not a standard name.

Tell me what you think

@fhamborg
Copy link
Owner

fhamborg commented Dec 2, 2024

Then perhaps I have misunderstood the purpose of rss_parent_pages = ['', 'blog', 'actualite'] Just to avoid any misunderstanding, could you write an update post on what specifically this PR intends to do and how the PR's code achieves that? So far we have a bunch of new terms here (e.g., crawl_from_base_domain_by_rss_crawler, rss_patterns, and rss_parent_pages - only the last of which is in the actual code), I might have lost track here. Could you also please double-check if it's necessary to update the config_lib.cfg - if I understood the PR correctly (which I might have not), the newly added option would only be relevant to config.cfg

Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants