Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] HTML is modified when saved #593

Open
line-o opened this issue Feb 11, 2023 · 20 comments
Open

[BUG] HTML is modified when saved #593

line-o opened this issue Feb 11, 2023 · 20 comments
Labels

Comments

@line-o
Copy link
Member

line-o commented Feb 11, 2023

Describe the bug

  1. When an html file is opened it is modified before it is displayed.
  2. When an html file is saved it modified before it is saved.

NOTE: This issue is about

  • autoclosing empty tags in HTML that are not allowed to be autoclosing such as <script>
  • swallowing CDATA sections and escaping the enclosed content

Both are likely due to some serialization that takes place.

Expected behavior

A source file is opened and displayed as it is stored within the database.
A source file created or modified in eXide is stored unmodified.

To Reproduce

  1. store a file /db/apps/test/test.html in the database with following contents
    NOTE: you have to use other means than eXide to do that, obviously. Think: WebDav, rest (curl), XML-RPC (xst)
<html>
    <body>
        <main></main>
        <script type="text/javascript"><![CDATA[ console.log(true && 1 < 2) ]]></script>
    </body>
</html>
  1. open the file in eXide and you will get
  • an autoclosed, empty main-element (this is fine for XML but detrimental for HTML)
  • the CDATA section is lost and its contents are now escaped with entities
<html>
    <body>
        <main/>
        <script type="text/javascript"> console.log(true &amp;&amp; 1 &lt; 2) </script>
    </body>
</html>
  1. modify the contents to the original content again and save it
  2. load the contents via WebDav, REST, XML-RPC again (curl http://localhost:8080/exist/rest/db/apps/test/test.html)
  3. the CDATA section was swallowed and its enclosed content escaped with entities
    NOTE: That the file has an XML declaration now might be due to the serialization on retrieval via curl
<?xml version="1.0" encoding="UTF-8"?>
<html>
    <body>
        <main></main>
        <script type="text/javascript"> console.log(true &amp;&amp; 1 &lt; 2) </script>
    </body>
</html>
  1. reloading the file also changes the contents back to what we got at step 2

Integration Test (TODO)

For UI and browser based testing we use cypress.js

// adjust as necessary
describe('The app', function() {
  it('should load', function() {
    // Go to app start page
    cy.visit('/app/index.html')
  })

Context (please always complete the following information):

Build: eXist-6.2.0
Java: 1.8.0_362 (Azul Systems, Inc.)
OS: Mac OS X 12.6.3 (aarch64)
App Version: 3.5.0

Additional context

  • How is eXist-db installed? built from source
  • Any custom changes in e.g. conf.xml? none
@joewiz
Copy link
Member

joewiz commented Feb 12, 2023

What are your eXide > Edit > Preferences > Serialization settings? The default is to apply indentation - for backward compatibility. But you can disable indentation there, for whitespace fidelity.

@line-o
Copy link
Member Author

line-o commented Feb 12, 2023

@joewiz the issue is not indentation. It is autoclosing tags and swallowing CDATA sections.

@line-o
Copy link
Member Author

line-o commented Feb 12, 2023

@joewiz I clarified the issue description

@joewiz
Copy link
Member

joewiz commented Feb 12, 2023

Have you tested other editors? How much of this behavior is eXide specific, versus universal for eXist serializing HTML stored as XML (as is default) instead of as binary (as is possible with effort).

@line-o
Copy link
Member Author

line-o commented Feb 12, 2023

Any other editor I use behaves as expected (vim, vscode, IDEA). None of them autocloses tags nor swallow CDATA. I mean, I was able to create, upload and read the test document. exist-db does not do this either. This is why I opened the issue here.

@line-o
Copy link
Member Author

line-o commented Feb 12, 2023

I should add that I had plenty of time testing this (but could not point my finger on it) and it was reported to me by several other parties. Please also see the linked exist-db issue.
It made testing much more difficult as I had to work around this limitation in eXide in order to narrow down the issue.
This leads to "mysterious behavior" - hard to trace issues - when the output has to be valid HTML5.

@line-o
Copy link
Member Author

line-o commented Feb 12, 2023

To give you one example:

  • you create a well-formed HTML5 document
  • upload it to exist as part of a XAR
  • open the document in a browser
  • notice a typo
  • open the document in eXide
  • fix the typo
  • save the document
  • reload the document in the browser
  • all of a sudden other functionality breaks that you did not touch because eXide modified other parts without you realizing
  • install the XAR at some point and it is "magically" fixed
  • rinse and repeat

@line-o
Copy link
Member Author

line-o commented Feb 12, 2023

The scenario above gets worse when the tainted HTML page gets synced back to disk and committed into a repository.

Also, think about the amount of debugging necessary, if the project in question has an additional layer of indirection. It could be making use of the templating module for instance.

@joewiz
Copy link
Member

joewiz commented Feb 13, 2023

I think there are some things we can do to help the situation, particularly in eXide, with its uniquely customizable facility for setting the serialization parameters used when a document is loaded from the database (i.e., in modules/load.xq).

Specifically, we can detect that the file being loaded has a mime type of text/html and use HTML-specific values on certain serialization parameters, such as method, html-version, and cdata-section-elements. We could either add these as options or make them the default.

Other interfaces, such as REST, WebDAV, and XML-RPC are not so customizable; their parameters are set in conf.xml or webdav.properties. In my tests, opening your test.html document in oXygen via WebDAV or XML-RPC, the <main> element is closed. Even in eXide, not everything is up to the application to control. Whether a closed element is presented as <link/> or <link /> is entirely controlled by eXist.

To demonstrate that eXide can be patched to fix both the auto-closing tags and the CDATA issue, replace https://github.com/eXist-db/eXide/blob/develop/modules/load.xq#L60-L72 with the following:

serialize(
    doc($path),
    map:merge((
        map {
            "indent": $indent,
            "exist:expand-xincludes": $expand-xincludes
        },
        if ($mime eq "text/html") then
            map { 
                "method": "xhtml",
                "html-version": 5.0,
                "cdata-section-elements": xs:QName("script")
            }
        else
            ()
    ))
)

As you'll notice, this patch switches load.xq from using util:declare-option() to fn:serialize(). This is now possible since eXist-db/exist#2394 is closed. We need one more change to make this patch work. Add the following to the query's prolog:

declare namespace output="http://www.w3.org/2010/xslt-xquery-serialization";

declare option output:method "text";

A PR adding this to eXide should probably fold this into the serialization preferences. Users may like the settings above, or they may wish to retain the existing approach, i.e., serializing HTML using the XML output method. And support for the cdata-section-elements has use cases outside of HTML serialization and should be exposed.

Also, as mentioned, I think eXide has much finer-grained serialization options than other interfaces. A fuller investigation of these other interfaces' serialization limitations and approaches for giving users finer-grained control may be warranted.

The other interfaces have one advantage: they don't need to be told which elements may have CDATA blocks. I wonder if eXist could add some sort of serialization option that presents CDATA blocks without requiring them to be specified.

@JoernT
Copy link
Member

JoernT commented Feb 13, 2023

just a small addition:

eXide - and probably the other interfaces mentioned by Joe - is also auto-closing web component tags when they are empty. This is a serious problem that leads to documents being run in quirks mode and causing potential errors esp. when those web components are siblings in a document.

Self-closing elements are called 'void elements' nowday in the HTML5 spec - https://html.spec.whatwg.org/multipage/syntax.html#void-elements

@adamretter
Copy link
Member

adamretter commented Feb 13, 2023

I want to focus solely on the storage side of the equation first, as Serialization is a different matter and is configurable by the user to achieve whatever they wish.

There is an important concern here missing from the "Reproducible Example" which will possibly answer the proposed issue.

store a file /db/apps/test/test.html in the database with following contents

How "exactly" is that file being stored into the database? Different tools will do this differently!

Ultimately eXist-db can only store XML or Binary documents (i.e. its storage sub-system has no concept of HTML):

  1. If the HTML document is being stored as XML, then the behaviour you describe has been that way since eXist-db was conceived, and is expected, and therefore this would not be a bug.
  2. If the HTML document is being stored as Binary, and it is being modified in some way by eXist-db itself, then that would be a bug.

@line-o
Copy link
Member Author

line-o commented Feb 13, 2023

The example file is stored as XML. As @joewiz was able to show it is possible to load my example in eXide without the CDATA section to be replaced.
And just by following the steps to reproduce it is obvious eXist-db stored my example as an XML source.
Using other means of storing and retrieval, I used XML-RPC through xst, the document is kept as is.

@line-o
Copy link
Member Author

line-o commented Feb 13, 2023

The real world example is a template that is opened with doc().

@JoernT
Copy link
Member

JoernT commented Feb 13, 2023

@adamretter

If the HTML document is being stored as XML, then the behaviour you describe has been that way since eXist-db was conceived, and is expected, and therefore this would not be a bug.

Sorry - will all due respect - but the fact that it has been like this from the start of eXist-db does not say it is correct these days. And "is expected" by whom? Certainly not by me writing <my-component></my-component> and getting back <my-component/> which might break the parsing in the browser and result in quirks mode. The HTML parser in the browser considers the my-component element to be not closing essentially 'eating' following tags:

<my-component/>
<my-component/>

Here a browser would create a nested structure instead of siblings which completely breaks functionality.

Web Components have not been around in xhtml but have been introduced with HTML5 - unfortunately it has been decided that those cannot be 'void' to use the terms of the WHATWG but need to use a closing tag. Thus a HTML serializer must not modify those elements just because they have no content as an XML serializer would do. If it does it's not in sync with the standard.

@ALL the CDATA section are of a much lesser practical concern IMO as inline script (at least in production) should be forbidden anyway - however breaking it is still an issue dissallowing the quick dev-time hack.

@JoernT
Copy link
Member

JoernT commented Feb 13, 2023

This page is a good reference on void elements (the one allowed to be serialized as empty) and self-closing tags: https://developer.mozilla.org/en-US/docs/Glossary/Void_element

It clearly says:

Self-closing tags (<tag />) do not exist in HTML.

@joewiz
Copy link
Member

joewiz commented Feb 13, 2023

@JoernT @adamretter I believe that my patch to eXide comprehensively fixes both the CDATA and self-closing tags problems with opening XHTML documents in eXide. Opening Juri's XHTML test document (with one added <link> element in eXide yields this result:

<!DOCTYPE html>
<html>
    <head>
        <title>foo</title>
        <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@jinntec/fore@latest/resources/fore.css" />
    </head>
    <body>
        <main></main>
        <script><![CDATA[ console.log(true && 1 < 2) ]]></script>
    </body>
</html>

I believe that this is what he was after, in terms of meeting his expectations for opening HTML documents in eXide. All that's left for a thorough fix in eXide is to implement the user preferences for controlling this behavior.

If total fidelity to an HTML source document that is not well-formed XML is required, then the file needs to be stored and handled as a binary file. eXide should respect those preferences, too, although I haven't tested them. From my perspective, the patch I put together will let users view (* re: editing, see my next post) HTML documents as XHTML, with sane defaults for CDATA section elements. These documents can always be serialized as HTML, with its non-well-formed XML elements (namely, the "void elements" Joern mentioned, e.g., <link> instead of <link />)

eXist's REST, WebDAV, and XML-RPC interfaces, however, remain XML-centric and do not expose any method for XHTML-specific serialization preferences that users might expect or desire. What these interfaces do appear to provide is a feature not exposed to fn:serialize(): automatic preservation of CDATA sections as stored, and faithful retention of these in serialization. I'd like to see that exposed to fn:serialize() as an option. (Those discussion probably belongs in the eXist repo rather than here in the eXide repo.)

@joewiz
Copy link
Member

joewiz commented Feb 13, 2023

As for saving HTML files that contain CDATA blocks, I think the problem @line-o described is inherent to applications that rely on XQuery to parse XML. When eXide stores a file, the eXide client submits a PUT request to eXide's /store endpoint, which reads the payload with request:get-data(). As we know from this function's function documentation, it does the following:

Returns the content of a POST request. If the HTTP Content-Type header in the request identifies it as a binary document, then xs:base64Binary is returned. If its not a binary document, we attempt to parse it as XML and return a document-node(). If its not a binary or XML document, any other data type is returned as an xs:string representation or an empty sequence if there is no data to be read.

So it receives the payload and tries to parse it as XML and return a document node. Effectively, it performs the following:

xquery version "3.1";

``[
<!DOCTYPE html>
<html>
    <head>
        <title>foo</title>
        <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@jinntec/fore@latest/resources/fore.css" />
    </head>
    <body>
        <main></main>
        <script><![CDATA[ console.log(true && 1 < 2) ]]></script>
    </body>
</html>
]``
=> parse-xml()

In both eXist and BaseX, this query returns:

<html>
    <head>
        <title>foo</title>
        <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@jinntec/fore@latest/resources/fore.css"/>
    </head>
    <body>
        <main/>
        <script> console.log(true &amp;&amp; 1 &lt; 2) </script>
    </body>
</html>

In other words, the CDATA is stripped and its contents are escaped. Until XQuery offers a method to parse a document and preserve its CDATA section elements, I don't see a workaround for this behavior available to eXide.

@line-o
Copy link
Member Author

line-o commented Feb 13, 2023

As a band-aid for eXide I propose to store html by other means (like the rest API). At least the XML-RPC endpoints do check for well-formedness and for appropriate permissions as well.

@adamretter
Copy link
Member

Sorry - will all due respect - but the fact that it has been like this from the start of eXist-db does not say it is correct these days.

@JoernT Sorry if you misunderstood my comment. I was not saying that it is "correct these days", rather I was saying that eXist-db only has an XML DOM and a Blob Store. If you want to store HTML and not have it XML'ified to then the only option is to store it as a Blob. There is nothing wrong with storing and retrieving HTML as a Blob in eXist-db and that will ensure the HTML is not modified.

By saying it is not a bug, I am trying to say that eXist-db is not at fault. It is likely the manner of the user or tool that is storing that data into eXist-db that is at fault.

If you want to have eXist-db store HTML natively then you need to add a plethora of new features like we did in FusionDB to introduce a native HTML DOM, native storage for it, and a mapping onto the XML DOM for XPath and XQuery.

@adamretter
Copy link
Member

adamretter commented Feb 14, 2023

eXist's REST, WebDAV, and XML-RPC interfaces

@joewiz Each of these APIs allow you to store either XML or Binary documents. It is the choice of the user or application as to whether they store their HTML as either:

  1. Binary - will preserve the HTML byte for byte as is
  2. XML - will require the HTML to be transformed into XML (or XHTML) either explicitly (or possibly implicitly depending on the facilities offered by that API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants