-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML entities not decoded #30
Comments
+1 ;) |
Also use JSoup for more of the HTML cleaning.
I've submitted a fix for this. When the full CleanEval corpus is re-run, I'd suggest having it generate the minimal HTML tags, since the tags are included in the gold standard. |
I'm going to revise my opinion about the "correct approach" and turn it into a question. The gold standard doesn't entity encode less than ( It's pretty clear that the text mode should be fully decoded, but should the minimal HTML mode match the gold standard or produce legal XML? Is a third mode needed? |
Also use JSoup for more of the HTML cleaning.
Also use JSoup for more of the HTML cleaning.
Also use JSoup for more of the HTML cleaning.
Also use JSoup for more of the HTML cleaning.
Also use JSoup for more of the HTML cleaning.
Also use JSoup for more of the HTML cleaning.
Also use JSoup for more of the HTML cleaning.
Comparing these two files:
It appears that the Python program is dropping
entities, but not decoding some other such as<
. The gold standard doesn't include any HTML entities, naturally. I'd argue that the correct approach is to decode all HTML entities and convert them to their equivalent Unicode character, even though this is different from what the original Python program did.The text was updated successfully, but these errors were encountered: