Clean and normalize HTML. Preserve embeddings (e.g. Twitter, Instagram, etc)
Contents
Install the library with pip:
pip install clear-html
Example usage with lxml:
from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html
html="""
<div style="color:blue" id="main_content">
Some text to be
<div>cleaned up!</div>
</div>
"""
node = fromstring(html)
cleaned_node = clean_node(node)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)
Example usage with Parsel:
from parsel import Selector
from clear_html import clean_node, cleaned_node_to_html
selector = Selector(text="""<html>
<body>
<h1>Hello!</h1>
<div style="color:blue" id="main_content">
Some text to be
<div>cleaned up!</div>
</div>
</body>
</html>""")
selector = selector.css("#main_content")
cleaned_node = clean_node(selector[0].root)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)
Both of the different approaches above would print the following:
<article>
<p>Some text to be</p>
<p>cleaned up!</p>
</article>
Other interesting functions:
cleaned_node_to_text
: convert the cleaned node to plain textformatted_text.clean_doc
: low level method to control more aspects of the cleaning up