Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add stripping of duplicates in FeedMergeBridge based on title #4302

Open
DrErfinder opened this issue Oct 16, 2024 · 1 comment
Open
Labels
Feature-Request Issue is a feature request

Comments

@DrErfinder
Copy link

DrErfinder commented Oct 16, 2024

Is your feature request related to a problem? Please describe
This project and the FeedMergeBride by @dvikan are amazing. But I’ve stumbled over a problem.
Sometimes two feeds provide the same article. See #2855.
The existing feature in FeedMergeBride is great, when the url provided by the feeds are identical. I'm having the problem that some of my feeds have a different url for the same article. The title however is identical. So filtering duplicates based on the title is possible.
I’ve provided an example below under "Additional context".

Describe the solution you'd like
it would be great if FeedMergeBride could filter based on title. The easiest solution would be to first strip out all duplicates based on url and then strip the remaining based on title.
It would be ideal if there were two checkboxes when setting up FeedMergeBride: One for "Filter by url" and another to "Filter by title". This would allow desired behavior to be configured by the user.

Describe alternatives you've considered
I've copied FeedMergeBridge.php, renamed the file and class, then added the resulting new bridge to my installation.
After that I went ahead and edited the new bridge, replacing all occurrences of 'uri' with 'title' in lines 96 to 109.
In the end it looked like that:

// Remove duplicates by using title as unique key
$items = [];
foreach ($this->items as $item) {
	$index = $item['title'] ?? null;
	if ($index) {
		// Overwrite duplicates
		$items[$index] = $item;
	} else {
		$items[] = $item;
	}
}

See https://gist.github.com/DrErfinder/9a0ba483f897e21a0c4db0c6cae567f4 for full file.
So far this works flawless for my purposes.
However, I’m not good enough with PHP to know if that's the correct way or how to implement it without losing the existing deduplication.

Additional context
Here’s an example:
German IT-News-Website Heise.de provides a whole host of RSS-Feeds (See https://www.heise.de/news-extern/news.html for full list)
Some of the articles in their Apple specific feed (https://www.heise.de/mac-and-i/feed.xml) are also published in their general news feed (https://www.heise.de/rss/heise-atom.xml)
The url of the article is
"https://www.heise.de/news/Apple-Studie-Logisches-Denken-von-KI-kaum-nachweissbar-und-sehr-fragil-9980855.html?wt_mc=rss.red.ho.ho.atom.beitrag.beitrag" in one feed and
"https://www.heise.de/news/Apple-Studie-Logisches-Denken-von-KI-kaum-nachweissbar-und-sehr-fragil-9980855.html" in the other.
The title is "KI und logisches Denken: Apple-Forscher zweifeln – und warnen" in both cases. I assume the url differs for some form of analytics/statistics

@DrErfinder DrErfinder added the Feature-Request Issue is a feature request label Oct 16, 2024
@dvikan
Copy link
Contributor

dvikan commented Nov 23, 2024

Yes I think this is a good feature. Items with identical titles can be considered duplicates indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature-Request Issue is a feature request
Projects
None yet
Development

No branches or pull requests

2 participants