Skip to content

Generating a filtered full text RSS feed

Matt Sephton edited this page Feb 13, 2022 · 11 revisions

Generating a filtered full-text RSS feed from an existing limited RSS feed

Problem: Your favourite website has a feed but you only wish to read certain types of posts, and their feed only provides summarized text

Solution: Use Huginn to create an RSS feed that has filtered full-text content. The workflow to filter posts and fetch full text is as follows:

Contents

  1. RssAgent - to fetch and parse existing RSS feed
  2. TriggerAgent - to filter feed items
  3. WebsiteAgent - to fetch full content for feed item
  4. DataOutputAgent - to output RSS

Examples based on Dilbert

1. RssAgent

Name: 1.Download Dilbert limited feed

{
  "expected_update_period_in_days": "1",
  "clean": "false",
  "url": "https://dilbert.com/feed.rss"
},

Note: find the RSS URL for the site in the HEAD element of the HTML

2. TriggerAgent

Name: 2.Keep entries with comic strips
Event sources: 1.Download Dilbert limited feed
Propagate immediately: Yes

{
  "expected_receive_period_in_days": "1",
  "keep_event": "true",
  "rules": [
    {
      "type": "regex",
      "value": ".*strip.*",
      "path": "url"
    }
  ]
}

Notes:

  • "keep_event": "true" helps pass on original parsed item elements to next agent
  • the regex value matches any url containing the word "strip", which for us means comic strip items

3. WebsiteAgent

Name: 3.Get Content
Event sources: 2.Keep entries with comic strips
Propagate immediately: Yes

{
  "expected_update_period_in_days": "2",
  "url": "{{url}}",
  "type": "html",
  "mode": "merge",
  "extract": {
    "comictitle": {
      "css": ".comic-title-name",
      "value": "string(.)"
    },
    "comicrating": {
      "css": "div.comic-rating",
      "value": "@data-total"
    },
    "comicimgurl": {
      "xpath": "/html/body/div/div[4]/div[1]/div/div[2]/div/section/div[3]/a/img",
      "value": "@src"
    }
  }
}

Notes:

  • "mode": "merge" helps pass on original parsed item elements to next agent
  • use "Dry Run" here to confirm you are getting the same, correct number of each element
  • we want the string value of the element with the class comic-title-name
  • we want the data-total attribute of the div with the class comic-rating
  • we want the src attribute of the image located at the given xpath
  • the "xpath" is obtained using the browser web inspector, as follows:
    1. open web inspector
    2. in the html source, find the element you are interested in
    3. right click on the element to show the context menu
    4. choose Copy > XPath

4. DataOutputAgent

Name: 4.Output RSS
Event sources: 3.Get Content
Propagate immediately: Yes

{
  "secrets": [
    "dilbert-full-feed"
  ],
  "expected_receive_period_in_days": "1",
  "events_order": [
    [
      "{{id}}",
      "string",
      "false"
    ]
  ],
  "template": {
    "title": "Dilbert full feed",
    "description": "This is a feed of recent Dilbert comics, generated by Huginn",
    "item": {
      "title": "{{title}}",
      "description": "<p><em>{{comictitle}}</em></p><p>Rating: {{comicrating}}</p><p><a href=\"{{url}}\"><img src=\"{{comicimgurl}}\"></a></p>",
      "link": "{{url}}",
      "date_published": "{{last_updated}}",
      "pubDate": "{{last_updated}}"
    }
  },
  "ns_media": "true"
}

Notes:

  • "event_order" is needed to keep the items in the initial collection of items in the correct order on first subscription
  • we're ordering by the "id" field, which is a date "string", and not sorting in reverse "false"
  • we write a small amount of HTML for the body/description of the item containing the comictitle, comicrating, and the comicimageurl wrapped in a link to url the comic page
  • to ensure all items are correctly dated we set both date_published and pubDate to the last_updated value which is present in the original feed
  • bonus: if the time we had was not formatted correctly we could do something like: "pubDate": "{{timestamp | date: '%a, %d %b %Y %H:%M:%S %z'}}" to get from a Unix timestamp to the date format expected in an RSS feed.

Scenario

This json file can be imported directly into Huginn: dilbert-full-feed.json

Clone this wiki locally