Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conversion for HTML to markdown #932

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hrodz
Copy link

@hrodz hrodz commented Aug 20, 2023

Description:

Related issue (if applicable): #527

I use Mailrise to get email notifications for various services. The HTML to Text conversion works great for regular notifications but it will be nice to be able to receive and click on links. This is my attempt to implement a HTML to Markdown converter to be able to have links on the notifications.

This code is heavily base on the current HTML to Text converter. I'm not a developer but I'm willing to learn and work on the code to make this feature available. Let me know if this is something we can work on. I have been testing with Gotify notifications and it seems to work fine.

The only thing I have not been able to get working are line breaks. I have try with different characters (double space, \,  ) without success on Gotify. Thanks.

Checklist

  • The code change is tested and works locally.
  • There is no commented out code in this PR.
  • No lint errors (use flake8)
  • 100% test coverage

@caronc
Copy link
Owner

caronc commented Aug 21, 2023

Thank you for the PR, I'll try to review it soon

@codecov-commenter
Copy link

codecov-commenter commented Aug 21, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Files Coverage Δ
apprise/conversion.py 99.30% <98.71%> (-0.70%) ⬇️

... and 11 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

@caronc
Copy link
Owner

caronc commented Aug 22, 2023

@hrodz : this is a pretty massive undertaking, there is a lot of missing code coverage. But as you get your code more stable and to your liking, I can try and pitch in and help here.

On a second note:
Parsing HTML is fine if structured okay, but i'll be interested in how you handle things like...

<html
<!--- no closing tag above >
<body>test
<b>test</b>
</body>

Or tags that don't align:

<p>A story about something that has no closing paragraph tag.

Another scenario might be deep nesting with missing tags, etc.

Just to be clear, I'm not saying your solution has to be perfect 🙂 . But there are so many what-ifs when it comes to HTML. I'm actually really excited to see how you choose to solve these types of things. I'll be looking forward to your next commit and to see how this pull request evolves. 🚀

@hrodz
Copy link
Author

hrodz commented Aug 23, 2023

@caronc Thanks for taking the time to look at this and provide feedback. I will work on the code coverage. Just wanted to know if you have any thoughts about this before committing more time to it.

Regarding working with invalid HTML, the two example provided seem to be already covered by the current html_to_text conversion, this code will produce similar results. I understand that I may have to do more testing on the parsing of the invalid tags to be converted to markdown, but if possible I would like more insight on the extend of this validation. Would it be acceptable to have a similar handling of invalid HTML as the html_to_text conversion? Or should we expect it to handle more cases? Thanks.

@caronc
Copy link
Owner

caronc commented Oct 6, 2023

I reviewed your code; i had to make some changes, but i think i got the test coverage done.

The only thing is what you're doing is very tricky... it's hard to track indentation; so some formatting will get lost (hence if a code block should have been indented over because we're currently in a list (or <li> object for example....

But overall it seems to work. I updated the test cases a bit too... the previous code was escaping characters that shouldn't have been (like periods . for example) where in the markdown as \\.

Have a look at what i've done let me know your thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants