Allow passing in an array of elements_to_score and add 'pre' as a default #94

tuzz · 2023-11-09T18:07:34Z

Allow passing in an array of elements_to_score and add 'pre' as a default

We were experiencing a problem where the h1 text was not being included
in the Readability#content. Here is an example that demonstrates the problem:

<article>
  <header>
    <h1>Title</h1>
  </header>
  <section>
    <p>Paragraph</p>
  </section>
</article>

Previously, the code would add the <p>, <section> and <article> elements
as @candidates because it adds the parent and grand parent of every <p>. It
would not add the <header> element as a candidate.

Then, the best_candidate with the highest score is the <section> element.
The code then tries to add related siblings in #get_article but it wasn't
adding the <header> element because it wasn't in the list of candidates.

We can solve this problem by adding <h1> to the list of elements to score
which will then ensure that <header> parent is included in the
candidates and can be added as a related sibling.

This commit also adds <pre> to the list of default nodes to score
because it is included in arc90's original code here:

https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749

I'm not sure why this was omitted. Perhaps remove remove code blocks?
Furthermore, this pull request adds a second commit that attempts to solve a follow-on problem:

The code had two strategies for determining whether to include siblings
in the output after determining the best candidate based on score:

It checked if the sibling is a candidate that scored above a threshold
which is the maximum of 10 and 0.2 of the best_candidate’s score.
It checked if the sibling was a paragraph that was longer than 80
characters with a penalty given for each link within the paragraph.

Neither of these strategies worked well for extracting <h1> titles:

Failed because titles score poorly due to not containing many commas
Failed because titles are within <h1> or <header> elements

However, titles are usually longer than 80 characters and don’t contain
links so it seems reasonable to modify strategy 2) to allow for other
elements, such as <h1> and <header> to be included as related siblings.

Therefore, this commit introduces a :likely_silings option that defaults
to the same <p> elements as before but can now be set by the developer
to include other elements such as <h1> and <header>. These are not
added by default to remain in sync with Arc90’s original implementation.

@candidates

…ault We were experiencing a problem where the h1 text was not being included in the Readability#content. Here is an example that demonstrates the problem: ``` <article> <header> <h1>Title</h1> </header> <section> <p>Paragraph</p> </section> </article> ``` Previously, the code would add the <p>, <section> and <article> elements as @candidates because it adds the parent and grand parent of every <p>. It would not add the <header> element as a candidate. Then, the best_candidate with the highest score is the <section> element. The code then tries to add related siblings in #get_article but it wasn't adding the <header> element because it wasn't in the list of candidates. We can solve this problem by adding <h1> to the list of elements to score which will then ensure that <header> parent is included in the candidates and can be added as a related sibling. This commit also adds <pre> to the list of default nodes to score because it is included in arc90's original code here: https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749 I'm not sure why this was omitted.

…option The code had two strategies for determining whether to include siblings in the output after determining the best candidate based on score: 1) It checked if the sibling is a candidate that scored above a threshold which is the maximum of 10 and 0.2 of the best_candidate’s score. 2) It checked if the sibling was a paragraph that was longer than 80 characters with a penalty given for each link within the paragraph. Neither of these strategies worked well for extracting <h1> titles: 1) Failed because titles score poorly due to not containing many commas 2) Failed because titles are within <h1> or <header> elements However, titles are usually longer than 80 characters and don’t contain links so it seems reasonable to modify strategy 2) to allow for other elements, such as <h1> and <header> to be included as related siblings. Therefore, this commit introduces a :likely_silings option that defaults to the same <p> elements as before but can now be set by the developer to include other elements such as <h1> and <header>. These are not added by default to remain in sync with Arc90’s original implementation.

cantino · 2023-11-11T01:15:28Z

Thanks @tuzz!

nattsw · 2024-06-09T10:48:45Z

@cantino will you be releasing a new version of this on https://rubygems.org/gems/ruby-readability? 🤗

cantino · 2024-06-11T23:35:01Z

Done, 0.7.1.

…content (#27508) For Topic Embeds, we would prefer <article> to be the main article in a topic, rather than a table cell <td> with potentially a lot of data. However, in an example URL like here, the table cell (the very large code snippet) is seen as the Topic Embed's article due to the determined content weight by the Readability library we use. In the newly released 0.7.1 cantino/ruby-readability#94, the library has a new option to exclude the library's default <td> element into content weighting. This is more in line with the original library where they only weighted <p>. So this PR excludes the td, as seen in the tests, to allow the actual article to be seen as the article. This PR also adds the details tag into the allow-list.

tuzz mentioned this pull request Nov 9, 2023

Allow passing in an array of elements_to_score and add 'pre' as a default #93

Closed

cantino merged commit 599ed39 into cantino:master Nov 11, 2023

nattsw mentioned this pull request Jun 18, 2024

FIX: Disallow table cells to be weighted so actual articles can be main content discourse/discourse#27508

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow passing in an array of elements_to_score and add 'pre' as a default #94

Allow passing in an array of elements_to_score and add 'pre' as a default #94

tuzz commented Nov 9, 2023 •

edited

Loading

cantino commented Nov 11, 2023

nattsw commented Jun 9, 2024

cantino commented Jun 11, 2024

Allow passing in an array of elements_to_score and add 'pre' as a default #94

Allow passing in an array of elements_to_score and add 'pre' as a default #94

Conversation

tuzz commented Nov 9, 2023 • edited Loading

cantino commented Nov 11, 2023

nattsw commented Jun 9, 2024

cantino commented Jun 11, 2024

tuzz commented Nov 9, 2023 •

edited

Loading