Generating a filtered full text RSS feed

Generating a filtered full-text RSS feed from an existing limited RSS feed

Problem: Your favourite website has a feed but you only wish to read certain types of posts, and their feed only provides summarized text

Solution: Use Huginn to create an RSS feed that has filtered full-text content. The workflow to filter posts and fetch full text is as follows:

1. RssAgent

Name: 1.Download Dilbert limited feed

{
  "expected_update_period_in_days": "1",
  "clean": "false",
  "url": "https://dilbert.com/feed.rss"
},

Note: find the RSS URL for the site in the HEAD element of the HTML

2. TriggerAgent

Name: 2.Keep entries with comic strips
Event sources: 1.Download Dilbert limited feed
Propagate immediately: Yes

{
  "expected_receive_period_in_days": "1",
  "keep_event": "true",
  "rules": [
    {
      "type": "regex",
      "value": ".*strip.*",
      "path": "url"
    }
  ]
}

Notes:

"keep_event": "true" helps pass on original parsed item elements to next agent

the regex value matches any url containing the word "strip", which for us means comic strip items

3. WebsiteAgent

Name: 3.Get Content
Event sources: 2.Keep entries with comic strips
Propagate immediately: Yes

{
  "expected_update_period_in_days": "2",
  "url": "{{url}}",
  "type": "html",
  "mode": "merge",
  "extract": {
    "comictitle": {
      "css": ".comic-title-name",
      "value": "string(.)"
    },
    "comicrating": {
      "css": "div.comic-rating",
      "value": "@data-total"
    },
    "comicimgurl": {
      "xpath": "/html/body/div/div[4]/div[1]/div/div[2]/div/section/div[3]/a/img",
      "value": "@src"
    }
  }
}

Notes:

"mode": "merge" helps pass on original parsed item elements to next agent

use "Dry Run" here to confirm you are getting the same, correct number of each element

we want the string value of the element with the class comic-title-name

we want the data-total attribute of the div with the class comic-rating

we want the src attribute of the image located at the given xpath

the "xpath" is obtained using the browser web inspector, as follows:

open web inspector

in the html source, find the element you are interested in

right click on the element to show the context menu

choose Copy > XPath

4. DataOutputAgent

Name: 4.Output RSS
Event sources: 3.Get Content
Propagate immediately: Yes

{
  "secrets": [
    "dilbert-full-feed"
  ],
  "expected_receive_period_in_days": "1",
  "events_order": [
    [
      "{{id}}",
      "string",
      "false"
    ]
  ],
  "template": {
    "title": "Dilbert full feed",
    "description": "This is a feed of recent Dilbert comics, generated by Huginn",
    "item": {
      "title": "{{title}}",
      "description": "<p><em>{{comictitle}}</em></p><p>Rating: {{comicrating}}</p><p><a href=\"{{url}}\"><img src=\"{{comicimgurl}}\"></a></p>",
      "link": "{{url}}",
      "date_published": "{{last_updated}}",
      "pubDate": "{{last_updated}}"
    }
  },
  "ns_media": "true"
}

Notes:

"event_order" is needed to keep the items in the initial collection of items in the correct order on first subscription

we're ordering by the "id" field, which is a date "string", and not sorting in reverse "false"

we write a small amount of HTML for the body/description of the item containing the comictitle, comicrating, and the comicimageurl wrapped in a link to url the comic page

to ensure all items are correctly dated we set both date_published and pubDate to the last_updated value which is present in the original feed

bonus: if the time we had was not formatted correctly we could do something like: "pubDate": "{{timestamp | date: '%a, %d %b %Y %H:%M:%S %z'}}" to get from a Unix timestamp to the date format expected in an RSS feed.

Scenario

This json file can be imported directly into Huginn: dilbert-full-feed.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly