-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
html_text2 deletes some spaces between words #372
Comments
Interestingly, removing the first empty paragraph allows a correct conversion:
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
The attributes don't seem to be necessary to illustrate the problem, leading to this similar reprex: library(rvest)
some_html <- "<p></p><span>The sentence starts this way,</span><span> </span><span>then</span><span> </span><span>spaces</span><span> </span><span>disappear</span>"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way,thenspacesdisappear" Created on 2023-08-09 with reprex v2.0.2 And we can make it much easier to see what's going on by adding some newlines: library(rvest)
some_html <- "
<p></p>
<span>The sentence starts this way,</span>
<span> </span>
<span>then</span>
<span> </span>
<span>spaces</span>
<span> </span>
<span>disappear</span>
"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way,thenspacesdisappear" Created on 2023-08-09 with reprex v2.0.2 The key problem appears to be the early closing of the library(rvest)
some_html <- "
<p>
<span>The sentence starts this way,</span>
<span> </span>
<span>then</span>
<span> </span>
<span>spaces</span>
<span> </span>
<span>disappear</span>
</p>
"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way, then spaces disappear" Created on 2023-08-09 with reprex v2.0.2 Looking at the code, it seems like the problem probably arises if you have inline elements following a block element. Fixing that will require some careful thought. |
The problem arises if there are inline and block elements mixed on the same level, regardless of which comes first. Then An element could also contain several block elements with text nodes and inline elements in between. In that case, all non-block nodes between two block nodes should be passed together through |
In some cases, html_text2 deletes some standard spaces between words.
The reproducible example follows:
The incorrect result is:
"The sentence starts this way,thenspacesdisappear"
html_text() works correctly, but on most cases I do need the power of html_text2 (new lines...).
I'm using: rvest_1.0.3 , xml2_1.3.3 in R 4.2.2 (Kubuntu 23.04).
(Note: The original html string comes from a rich text area of a Moodle Database activity, see https://docs.moodle.org/402/en/Database_activity; exported from Moodle as a LibreOffice .ods file)
The text was updated successfully, but these errors were encountered: