-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Generating a filtered full text RSS feed
Problem: Your favourite website has a feed but you only wish to read certain types of posts, and their feed only provides summarized text
Solution: Use Huginn to create an RSS feed that has filtered full-text content. The workflow to filter posts and fetch full text is as follows:
- RssAgent - to fetch and parse existing RSS feed
- TriggerAgent - to filter feed items
- WebsiteAgent - to fetch full content for feed item
- DataOutputAgent - to output RSS
Examples based on Dilbert
Name: 1.Download Dilbert limited feed
{
"expected_update_period_in_days": "1",
"clean": "false",
"url": "https://dilbert.com/feed.rss"
},
Note: find the RSS URL for the site in the HEAD element of the HTML
Name: 2.Keep entries with comic strips
Event sources: 1.Download Dilbert limited feed
Propagate immediately: Yes
{
"expected_receive_period_in_days": "1",
"keep_event": "true",
"rules": [
{
"type": "regex",
"value": ".*strip.*",
"path": "url"
}
]
}
Notes:
- "keep_event": "true" helps pass on original parsed item elements to next agent
- the
regex
value
matches anyurl
containing the word "strip", which for us means comic strip items
Name: 3.Get Content
Event sources: 2.Keep entries with comic strips
Propagate immediately: Yes
{
"expected_update_period_in_days": "2",
"url": "{{url}}",
"type": "html",
"mode": "merge",
"extract": {
"comictitle": {
"css": ".comic-title-name",
"value": "string(.)"
},
"comicrating": {
"css": "div.comic-rating",
"value": "@data-total"
},
"comicimgurl": {
"xpath": "/html/body/div/div[4]/div[1]/div/div[2]/div/section/div[3]/a/img",
"value": "@src"
}
}
}
Notes:
- "mode": "merge" helps pass on original parsed item elements to next agent
- use "Dry Run" here to confirm you are getting the same, correct number of each element
- we want the
string
value of the element with the classcomic-title-name
- we want the
data-total
attribute of thediv
with the classcomic-rating
- we want the
src
attribute of the image located at the givenxpath
- the "xpath" is obtained using the browser web inspector, as follows:
- open web inspector
- in the html source, find the element you are interested in
- right click on the element to show the context menu
- choose Copy > XPath
Name: 4.Output RSS
Event sources: 3.Get Content
Propagate immediately: Yes
{
"secrets": [
"dilbert-full-feed"
],
"expected_receive_period_in_days": "1",
"events_order": [
[
"{{id}}",
"string",
"false"
]
],
"template": {
"title": "Dilbert full feed",
"description": "This is a feed of recent Dilbert comics, generated by Huginn",
"item": {
"title": "{{title}}",
"description": "<p><em>{{comictitle}}</em></p><p>Rating: {{comicrating}}</p><p><a href=\"{{url}}\"><img src=\"{{comicimgurl}}\"></a></p>",
"link": "{{url}}",
"date_published": "{{last_updated}}",
"pubDate": "{{last_updated}}"
}
},
"ns_media": "true"
}
Notes:
- "event_order" is needed to keep the items in the initial collection of items in the correct order on first subscription
- we're ordering by the "id" field, which is a date "string", and not sorting in reverse "false"
- we write a small amount of HTML for the body/description of the item containing the
comictitle
,comicrating
, and thecomicimageurl
wrapped in a link tourl
the comic page- to ensure all items are correctly dated we set both
date_published
andpubDate
to thelast_updated
value which is present in the original feed- bonus: if the time we had was not formatted correctly we could do something like:
"pubDate": "{{timestamp | date: '%a, %d %b %Y %H:%M:%S %z'}}"
to get from a Unix timestamp to the date format expected in an RSS feed.
This json file can be imported directly into Huginn: dilbert-full-feed.json