-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow passing in an array of elements_to_score and add 'pre' as a default #94
Merged
cantino
merged 2 commits into
cantino:master
from
beyondwords-io:allow-passing-in-elements-to-score
Nov 11, 2023
Merged
Allow passing in an array of elements_to_score and add 'pre' as a default #94
cantino
merged 2 commits into
cantino:master
from
beyondwords-io:allow-passing-in-elements-to-score
Nov 11, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ault We were experiencing a problem where the h1 text was not being included in the Readability#content. Here is an example that demonstrates the problem: ``` <article> <header> <h1>Title</h1> </header> <section> <p>Paragraph</p> </section> </article> ``` Previously, the code would add the <p>, <section> and <article> elements as @candidates because it adds the parent and grand parent of every <p>. It would not add the <header> element as a candidate. Then, the best_candidate with the highest score is the <section> element. The code then tries to add related siblings in #get_article but it wasn't adding the <header> element because it wasn't in the list of candidates. We can solve this problem by adding <h1> to the list of elements to score which will then ensure that <header> parent is included in the candidates and can be added as a related sibling. This commit also adds <pre> to the list of default nodes to score because it is included in arc90's original code here: https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749 I'm not sure why this was omitted.
…option The code had two strategies for determining whether to include siblings in the output after determining the best candidate based on score: 1) It checked if the sibling is a candidate that scored above a threshold which is the maximum of 10 and 0.2 of the best_candidate’s score. 2) It checked if the sibling was a paragraph that was longer than 80 characters with a penalty given for each link within the paragraph. Neither of these strategies worked well for extracting <h1> titles: 1) Failed because titles score poorly due to not containing many commas 2) Failed because titles are within <h1> or <header> elements However, titles are usually longer than 80 characters and don’t contain links so it seems reasonable to modify strategy 2) to allow for other elements, such as <h1> and <header> to be included as related siblings. Therefore, this commit introduces a :likely_silings option that defaults to the same <p> elements as before but can now be set by the developer to include other elements such as <h1> and <header>. These are not added by default to remain in sync with Arc90’s original implementation.
Thanks @tuzz! |
@cantino will you be releasing a new version of this on https://rubygems.org/gems/ruby-readability? 🤗 |
Done, 0.7.1. |
nattsw
added a commit
to discourse/discourse
that referenced
this pull request
Jun 19, 2024
…content (#27508) For Topic Embeds, we would prefer <article> to be the main article in a topic, rather than a table cell <td> with potentially a lot of data. However, in an example URL like here, the table cell (the very large code snippet) is seen as the Topic Embed's article due to the determined content weight by the Readability library we use. In the newly released 0.7.1 cantino/ruby-readability#94, the library has a new option to exclude the library's default <td> element into content weighting. This is more in line with the original library where they only weighted <p>. So this PR excludes the td, as seen in the tests, to allow the actual article to be seen as the article. This PR also adds the details tag into the allow-list.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Allow passing in an array of elements_to_score and add 'pre' as a default
We were experiencing a problem where the h1 text was not being included
in the Readability#content. Here is an example that demonstrates the problem:
Previously, the code would add the
<p>
,<section>
and<article>
elementsas
@candidates
because it adds the parent and grand parent of every<p>
. Itwould not add the
<header>
element as a candidate.Then, the best_candidate with the highest score is the
<section>
element.The code then tries to add related siblings in #get_article but it wasn't
adding the
<header>
element because it wasn't in the list of candidates.We can solve this problem by adding
<h1>
to the list of elements to scorewhich will then ensure that
<header>
parent is included in thecandidates and can be added as a related sibling.
This commit also adds
<pre>
to the list of default nodes to scorebecause it is included in arc90's original code here:
https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749
I'm not sure why this was omitted. Perhaps remove remove code blocks?
Furthermore, this pull request adds a second commit that attempts to solve a follow-on problem:
The code had two strategies for determining whether to include siblings
in the output after determining the best candidate based on score:
It checked if the sibling is a candidate that scored above a threshold
which is the maximum of 10 and 0.2 of the best_candidate’s score.
It checked if the sibling was a paragraph that was longer than 80
characters with a penalty given for each link within the paragraph.
Neither of these strategies worked well for extracting
<h1>
titles:<h1>
or<header>
elementsHowever, titles are usually longer than 80 characters and don’t contain
links so it seems reasonable to modify strategy 2) to allow for other
elements, such as
<h1>
and<header>
to be included as related siblings.Therefore, this commit introduces a
:likely_silings
option that defaultsto the same
<p>
elements as before but can now be set by the developerto include other elements such as
<h1>
and<header>
. These are notadded by default to remain in sync with Arc90’s original implementation.