H1 gets lost #19

louismullie · 2012-02-25T20:21:40Z

I have been experimenting with the gem to retrieve content from Wikipedia pages, but it seems that the H1 tags get lost during the process of text extraction:

source = open('http://en.wikipedia.org/wiki/Frimley_Green_Windmill').read
puts Readability::Document.new(source, tags: ['h1', 'p', 'div']).content

Output:

<div><div>
<p>Frimley Green Windmill is a Grade II listed[1]tower mill at Frimley Green, Surrey, England which has been converted to residential use.</p>
 [edit] History 
<p>Frimley Green Windmill was first mentioned in 1784 in the ownership of a Mr Terry. It passed to Thomas Lilley in 1792 and then William Collins in 1801. In 1803, the mill passed into the ownership of the Royal Military College, Sandhurst, remaining in the hands of the military until at least 1832 and probably much later than that. The mill was disused by 1870, and the derelict shell was converted to residential use in 1914. [2]</p>
 [edit] Description 
<div>For an explanation of the various pieces of machinery, see Mill machinery.</div>
<p>Frimley Green Windmill is a four storey brick tower mill. Little is known of the mill, although it had at least one pair of Spring or Patent sails.[2]</p>
 [edit] Millers 
 George Marshall 1792
John Banks 1801
 <p>Reference for above:-[2]</p>
 [edit] External links 
 [edit] References 

</div></div>

This is missing the only h1 tag on the page,

<h1 id="firstHeading" class="firstHeading">Frimley Green Windmill</h1>

I have experienced the same quirk with all Wikipedia pages. Any idea what could be causing this?

louismullie · 2012-02-25T21:57:22Z

A similar behaviour can be experienced with the following HTML (http://www.economist.com/node/21548244) :

<h2 class="fly-title">Campaign finance</h2>
<h3 class="headline">The hands that prod, the wallets that feed</h3>
<h1 class="rubric">Super PACs are changing the face of American politics. </h1>

None of the H1, H2, H3 tags get retrieved even when I specify them in the :tags option.

cantino · 2012-02-26T22:15:34Z

What does the JS version of Readability do on those pages?

louismullie · 2012-02-27T00:24:57Z

If that's what you mean, the Readability API correctly parses the pages : http://www.readability.com/articles/urlh3i3g, http://www.readability.com/articles/l2exnq9u.

louismullie · 2012-03-13T18:09:54Z

If you can point me toward the right direction in the code, I can make a patch and I'll send you a pull request.

ghost · 2012-03-14T05:24:05Z

They must have revised the Readability code since I last ported it. You'll need to walk through the JavaScript and compare it to what the Ruby is doing. I'm not actively using ruby-readability in any current projects, so I haven't had time to do this myself. It'd be excellent if you want to give it a shot.

louismullie · 2012-03-15T04:56:57Z

Alrighty, I'll see what I can do when I have some time.

mraaroncruz · 2012-08-09T10:02:45Z

This seems to have more to do with where ruby-readability decides where the content of the page lies than what tags it is accepting.
Try

source = open('http://en.wikipedia.org/wiki/Frimley_Green_Windmill').read
puts Readability::Document.new(source, tags: ['h1', 'h2', 'p', 'div']).content # added 'h2'

and you will see the h2s from that page.
I haven't dug into the source enough to see why, but it doesn't seem to be a headline issue at least. The markup on the economist page doesn't seem super helpful to a generic library like this. I wonder how they do it now (where they are catching these)...

gioele · 2013-03-29T23:22:32Z

The problem appears when h1 elements are contained outside the best candidate. This is an example:

<div id="container">
    <div id="article">
        <h1>Main title</h1>

        <div id="content">
           <h2>Section title</h2>
           <p>content</p>
           <p>content</p>

           <h2>Section title</h2>
           <p>content</p>
        </div>
    </div>
</div>

The #content element will always have a better score than #article because it always has an higher link density (same number of links, less content). The h1 in #article will thus never be included in the result. This confirms the idea of @pferdefleisch.

A possible solution may be to increase the score of an element if it contains many non-excluded elements. This will increase the score of the #article element because it will include strictly more accepted tags than #content.

cantino · 2013-03-30T01:31:21Z

That's interesting. If you want to propose a pull request, that seems like a reasonable solution unless it breaks a lot of specs/behaviors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H1 gets lost #19

H1 gets lost #19

louismullie commented Feb 25, 2012

louismullie commented Feb 25, 2012

cantino commented Feb 26, 2012

louismullie commented Feb 27, 2012

louismullie commented Mar 13, 2012

ghost commented Mar 14, 2012

louismullie commented Mar 15, 2012

mraaroncruz commented Aug 9, 2012

gioele commented Mar 29, 2013

cantino commented Mar 30, 2013

H1 gets lost #19

H1 gets lost #19

Comments

louismullie commented Feb 25, 2012

louismullie commented Feb 25, 2012

cantino commented Feb 26, 2012

louismullie commented Feb 27, 2012

louismullie commented Mar 13, 2012

ghost commented Mar 14, 2012

louismullie commented Mar 15, 2012

mraaroncruz commented Aug 9, 2012

gioele commented Mar 29, 2013

cantino commented Mar 30, 2013