Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help troubleshooting what's stripped out #96

Open
avk opened this issue Jan 10, 2024 · 2 comments
Open

Help troubleshooting what's stripped out #96

avk opened this issue Jan 10, 2024 · 2 comments

Comments

@avk
Copy link

avk commented Jan 10, 2024

Thanks for your work on this neat gem.

Running readability on the HTML from https://100wordstory.org/submit/, I expected more markup to remain than readability leaves intact.

Expected

image

Observed

In the screenshot above, the following content is stripped out:

  1. the red "Submit" heading:
<h1 class="titles">
  <a href="https://100wordstory.org/submit/" rel="bookmark" title="SubmitPermanent Link to ">Submit</a>
</h1>
  1. the red "Submissions are now open through January 9, 2024" and "Submit!" headings and links:
<h2 style="text-align: center;"><a href="https://100wordstory.submittable.com/submit">Submissions are now open through January 9, 2024!</a></h2>
<h2 style="text-align: center;"><a href="https://100wordstory.submittable.com/submit">Submit!</a></h2>

Turning on debug: true doesn't seem to cite why these items are missing:

% readability -d https://100wordstory.org/submit/
/Users/avk/.rvm/gems/ruby-2.7.8@wbm/gems/ruby-readability-0.7.0/bin/readability:31: warning: calling URI.open via Kernel#open is deprecated, call URI.open directly or use URI#open
Removing unlikely candidate - magnific_popup-css
Removing unlikely candidate - nav superfishmenu-100-word-story-menu
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-73menu-item-73
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page current-menu-item page_item page-item-6 current_page_item menu-item-72menu-item-72
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-83menu-item-83
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-189menu-item-189
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-70menu-item-70
Removing unlikely candidate - header
Removing unlikely candidate - comments
Removing unlikely candidate - commentlist clearfix
Removing unlikely candidate - comment even thread-even depth-1 parentcomment-65
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment byuser comment-author-100words bypostauthor odd alt depth-2comment-66
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment byuser comment-author-100words bypostauthor even thread-odd thread-alt depth-1comment-57
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment odd alt thread-even depth-1comment-56
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment even thread-odd thread-alt depth-1comment-52
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - sidebar-wrapper
Removing unlikely candidate - sidebar
Removing unlikely candidate - sidebar-box widget_blockblock-3
Removing unlikely candidate - widget_text sidebar-box widget_custom_htmlcustom_html-2
Removing unlikely candidate - sidebar-box widget_texttext-3
Removing unlikely candidate - sidebar-box widget_texttext-4
Removing unlikely candidate - sidebar-box widget_texttext-7
Removing unlikely candidate - sidebar-box widget_linkslinkcat-10
Removing unlikely candidate - footer
Altering div(#pages.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Top 5 candidates:
Candidate div#.post-wrapper with score 51.935052531041066
Candidate div#left-div. with score 16.71186440677966
Best candidate div#.post-wrapper with score 51.935052531041066
Conditionally cleaned div#.addtoany_share_save_container addtoany_content addtoany_content_bottom with weight 25 and content score 0 because it has too short a content length without a single image.
Conditionally cleaned div#.a2a_kit a2a_kit_size_24 addtoany_list with weight 0 and content score 0 because it has too short a content length without a single image.
Conditionally cleaned div#.recentposts with weight 25 and content score 0 because it has too short a content length without a single image.
<div><div>
					
					
					
					<p>100 words for your story … no more or no less. Tell a story, pen a slice of your memoir, or try your hand at an essay.</p>
<p>You get 100 words—exactly 100 words—which is both the pain and the pleasure here. It’s short, you tell yourself. You could write 100 words at a bus stop, on your lunch break, in your sleep. But with 100 words you must tell the whole story in its entirety, so it holds together like a perfect little doll house. (Your title is not part of the 100 words.)</p>
<p>Please include a short bio (25 words, max!) with your submission. Also, did we say exactly 100 words? We weren’t kidding! We count words according to Microsoft Word’s word-count tally. Also, make friends with your spell-check, or have a friend proofread your story.</p>
<p>We currently charge a $2 submission fee, the minimum in order to cover the costs of the submission system.</p>
<p> </p>
<p> </p>
					
										
											
						
						
									</div></div>

Any ideas on how to broaden or include this content?

@avk
Copy link
Author

avk commented Jan 10, 2024

I think I've traced this particular effect to this part of #sanitize:

      node.css("h1, h2, h3, h4, h5, h6").each do |header|
        header.remove if class_weight(header) < 0 || get_link_density(header) > 0.33
      end

What's the thinking behind link density, especially applied to headings? Is there a way to customize or tune this?

@cantino
Copy link
Owner

cantino commented Apr 29, 2024

This is the same code as was in the original readability.js from which this was ported, I think. You could parameterize it if you want to make it more flexible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants