Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing in an array of elements_to_score and add 'pre' as a default #94

Merged
merged 2 commits into from
Nov 11, 2023
Merged

Allow passing in an array of elements_to_score and add 'pre' as a default #94

merged 2 commits into from
Nov 11, 2023

Conversation

tuzz
Copy link
Contributor

@tuzz tuzz commented Nov 9, 2023

Allow passing in an array of elements_to_score and add 'pre' as a default

We were experiencing a problem where the h1 text was not being included
in the Readability#content. Here is an example that demonstrates the problem:

<article>
  <header>
    <h1>Title</h1>
  </header>
  <section>
    <p>Paragraph</p>
  </section>
</article>

Previously, the code would add the <p>, <section> and <article> elements
as @candidates because it adds the parent and grand parent of every <p>. It
would not add the <header> element as a candidate.

Then, the best_candidate with the highest score is the <section> element.
The code then tries to add related siblings in #get_article but it wasn't
adding the <header> element because it wasn't in the list of candidates.

We can solve this problem by adding <h1> to the list of elements to score
which will then ensure that <header> parent is included in the
candidates and can be added as a related sibling.

This commit also adds <pre> to the list of default nodes to score
because it is included in arc90's original code here:

https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749

I'm not sure why this was omitted. Perhaps remove remove code blocks?
Furthermore, this pull request adds a second commit that attempts to solve a follow-on problem:

The code had two strategies for determining whether to include siblings
in the output after determining the best candidate based on score:

  1. It checked if the sibling is a candidate that scored above a threshold
    which is the maximum of 10 and 0.2 of the best_candidate’s score.

  2. It checked if the sibling was a paragraph that was longer than 80
    characters with a penalty given for each link within the paragraph.

Neither of these strategies worked well for extracting <h1> titles:

  1. Failed because titles score poorly due to not containing many commas
  2. Failed because titles are within <h1> or <header> elements

However, titles are usually longer than 80 characters and don’t contain
links so it seems reasonable to modify strategy 2) to allow for other
elements, such as <h1> and <header> to be included as related siblings.

Therefore, this commit introduces a :likely_silings option that defaults
to the same <p> elements as before but can now be set by the developer
to include other elements such as <h1> and <header>. These are not
added by default to remain in sync with Arc90’s original implementation.

…ault

We were experiencing a problem where the h1 text was not being included
in the Readability#content. Here is an example that demonstrates the problem:

```
<article>
  <header>
    <h1>Title</h1>
  </header>
  <section>
    <p>Paragraph</p>
  </section>
</article>
```

Previously, the code would add the <p>, <section> and <article> elements
as @candidates because it adds the parent and grand parent of every <p>. It
would not add the <header> element as a candidate.

Then, the best_candidate with the highest score is the <section> element.
The code then tries to add related siblings in #get_article but it wasn't
adding the <header> element because it wasn't in the list of candidates.

We can solve this problem by adding <h1> to the list of elements to score
which will then ensure that <header> parent is included in the
candidates and can be added as a related sibling.

This commit also adds <pre> to the list of default nodes to score
because it is included in arc90's original code here:

https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749

I'm not sure why this was omitted.
…option

The code had two strategies for determining whether to include siblings
in the output after determining the best candidate based on score:

1) It checked if the sibling is a candidate that scored above a threshold
which is the maximum of 10 and 0.2 of the best_candidate’s score.

2) It checked if the sibling was a paragraph that was longer than 80
characters with a penalty given for each link within the paragraph.

Neither of these strategies worked well for extracting <h1> titles:

1) Failed because titles score poorly due to not containing many commas
2) Failed because titles are within <h1> or <header> elements

However, titles are usually longer than 80 characters and don’t contain
links so it seems reasonable to modify strategy 2) to allow for other
elements, such as <h1> and <header> to be included as related siblings.

Therefore, this commit introduces a :likely_silings option that defaults
to the same <p> elements as before but can now be set by the developer
to include other elements such as <h1> and <header>. These are not
added by default to remain in sync with Arc90’s original implementation.
@cantino cantino merged commit 599ed39 into cantino:master Nov 11, 2023
@cantino
Copy link
Owner

cantino commented Nov 11, 2023

Thanks @tuzz!

@nattsw
Copy link

nattsw commented Jun 9, 2024

@cantino will you be releasing a new version of this on https://rubygems.org/gems/ruby-readability? 🤗

@cantino
Copy link
Owner

cantino commented Jun 11, 2024

Done, 0.7.1.

nattsw added a commit to discourse/discourse that referenced this pull request Jun 19, 2024
…content (#27508)

For Topic Embeds, we would prefer <article> to be the main article in a topic, rather than a table cell <td> with potentially a lot of data. However, in an example URL like here, the table cell (the very large code snippet) is seen as the Topic Embed's article due to the determined content weight by the Readability library we use.

In the newly released 0.7.1 cantino/ruby-readability#94, the library has a new option to exclude the library's default <td> element into content weighting. This is more in line with the original library where they only weighted <p>. So this PR excludes the td, as seen in the tests, to allow the actual article to be seen as the article. This PR also adds the details tag into the allow-list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants