-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H1 gets lost #19
Comments
A similar behaviour can be experienced with the following HTML (http://www.economist.com/node/21548244) : <h2 class="fly-title">Campaign finance</h2>
<h3 class="headline">The hands that prod, the wallets that feed</h3>
<h1 class="rubric">Super PACs are changing the face of American politics. </h1> None of the H1, H2, H3 tags get retrieved even when I specify them in the |
What does the JS version of Readability do on those pages? |
If that's what you mean, the Readability API correctly parses the pages : http://www.readability.com/articles/urlh3i3g, http://www.readability.com/articles/l2exnq9u. |
If you can point me toward the right direction in the code, I can make a patch and I'll send you a pull request. |
They must have revised the Readability code since I last ported it. You'll need to walk through the JavaScript and compare it to what the Ruby is doing. I'm not actively using ruby-readability in any current projects, so I haven't had time to do this myself. It'd be excellent if you want to give it a shot. |
Alrighty, I'll see what I can do when I have some time. |
This seems to have more to do with where source = open('http://en.wikipedia.org/wiki/Frimley_Green_Windmill').read
puts Readability::Document.new(source, tags: ['h1', 'h2', 'p', 'div']).content # added 'h2' and you will see the |
The problem appears when <div id="container">
<div id="article">
<h1>Main title</h1>
<div id="content">
<h2>Section title</h2>
<p>content</p>
<p>content</p>
<h2>Section title</h2>
<p>content</p>
</div>
</div>
</div> The A possible solution may be to increase the score of an element if it contains many non-excluded elements. This will increase the score of the |
That's interesting. If you want to propose a pull request, that seems like a reasonable solution unless it breaks a lot of specs/behaviors. |
I have been experimenting with the gem to retrieve content from Wikipedia pages, but it seems that the H1 tags get lost during the process of text extraction:
Output:
This is missing the only h1 tag on the page,
I have experienced the same quirk with all Wikipedia pages. Any idea what could be causing this?
The text was updated successfully, but these errors were encountered: