Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readbility's title #64

Open
tybenz opened this issue Mar 4, 2014 · 5 comments
Open

Readbility's title #64

tybenz opened this issue Mar 4, 2014 · 5 comments

Comments

@tybenz
Copy link

tybenz commented Mar 4, 2014

Readability pulls its article title from the title tag right? Well more often than not, the title tag has a whole lot of other information besides just the title of the article. It usually includes the title of the site itself and sometimes a category.

I know the original readability script just grabbed the title, but I'm wondering if this version of the script can be modified to grab the actual title of the article from the markup. It seems as though the scoring system is set up to exclude the header tag that contains the article title.

Example:

<article>
  <div class="article-title">
    <h1>Article title</h1>
  </div>
  <div class="article-content">
    <p>
      Claritatem insitam; est usus legentis in iis qui facit eorum claritatem.
      Investigationes demonstraverunt lectores legere me lius quod ii legunt
      saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem
      consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc
      putamus parum claram, anteposuerit litterarum formas humanitatis per seacula
      quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur
      parum clari, fiant sollemnes in futurum.
    </p>
    <p>
      Nunc varius risus quis nulla. Vivamus vel magna. Ut rutrum. Aenean
      dignissim, leo quis faucibus semper, massa est faucibus massa, sit amet
      pharetra arcu nunc et sem. Aliquam tempor. Nam lobortis sem non urna.
      Pellentesque et urna sit amet leo accumsan volutpat. Nam molestie lobortis
      lorem. Quisque eu nulla. Donec id orci in ligula dapibus egestas. Donec sed
      velit ac lectus mattis sagittis.
    </p>
  </div>
</article>

In the above example, readability will always grab the content from .article-content and not the <article> tag itself. What can I do to modify the script to grab the whole article, title and all?

@cantino
Copy link
Owner

cantino commented Mar 5, 2014

Hey @tybenz! Interesting idea. Do you want to work on a pull request for that?

@tybenz
Copy link
Author

tybenz commented Mar 5, 2014

Yeah. I'd love to. I don't know enough about the scoring algorithm though. Wondering if you had any ideas on what a good start might be.

@cantino
Copy link
Owner

cantino commented Mar 5, 2014

No problem. I'd try to write a failing spec, then I'd take a look at score_node, class_weight, and REGEXES and see if something similar could be written to estimate which node is the title.

@tybenz
Copy link
Author

tybenz commented Mar 6, 2014

Also, I want to get something straight. Is it true that you only ever score p tags, td tags, and their parents and grandparents?

https://github.com/cantino/ruby-readability/blob/master/lib/readability.rb#L270-L271

Am I missing something?

@tuzz
Copy link
Contributor

tuzz commented Nov 9, 2023

Sorry to necro this issue. Yes, that's right @tybenz, it only scores <p>, <td> and their parents and grand parents.

Today I opened a pull request to allow you to specify other nodes to score, such as <h1> elements that might be nested inside a <header> element which would not be included in the list of candidates. See #93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants