Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soupault's HTML prettifying doesn't preserve whitespace correctly #46

Open
untitaker opened this issue Jun 13, 2022 · 9 comments
Open

Comments

@untitaker
Copy link

untitaker commented Jun 13, 2022

The following markdown document:

# Welcome to my website.

is converted by pandoc -f markdown -t html -fmarkdown-implicit_figures --no-highlight into:

<h1>Welcome to my website.</h1>

However, after soupault is done with parsing the output, the following HTML is produced:

<h1>
  Welcome to my website.
</h1>

This introduces another space after the period, which is visible in selections in Firefox, and does not have visible effect in Chrome. See also whatwg/html#8003

However, regardless of how browsers handle this, I think soupault should allow me to remove the trailing whitespace, and especially not mangle it by itself. Ideally, a HTML5 tokenizer should produce the same exact tokens before and after soupault has parsed and serialized the document.

@dmbaturin
Copy link
Collaborator

dmbaturin commented Jun 13, 2022

This is an interesting issue indeed... Intuitively, <pre> is the only element where leading and trailing whitespace around the element content should be significant, so my opinion is that all browsers should ignore it.

However, I agree that the current "always put tags on separate lines" approach is a bit heavy-handed and often produces a result that is the opposite of pretty. I'd be happy to work with the maintainer of lambdasoup to make it more flexible, but I suppose we'll have to wait for WHATWG's response regarding whitespace significance to know whether the current behavior should still be allowed or not.

Meanwhile, you can disable pretty-printing with pretty_print_html = false under [settings].

@dmbaturin
Copy link
Collaborator

Correction: with pretty_print_html = false, of course! I edited the original comment to fix that.

I should probably also improve the docs for that section because right now all those options are lumped together in "Basic configuration" now, but the commented config sample with them is really huge.

@untitaker
Copy link
Author

I don't think we have to wait to see what the browser vendors and spec body does with this issue. A functional HTML tokenizer and parser needs to keep the whitespace intact, this is very clear from the WHATWG spec. pretty_print_html=false definetly solves my issue, I also think it would be a better default.

@egrieco
Copy link

egrieco commented Jul 13, 2022

Is there actually an extra space U+0020character, or is the browser rendering the line feed U+000A? There is no trailing space character in the above example.

This may actually be a browser issue.

P.S. If you want specific formatting run a prettifier or a minifier on the code after Soupault generates the site. I was doing this with Zola before I found Soupault and it actually helped me catch a few errors in the framework I was using.

I'd love to see asset pipeline or post-processing support in Soupault, though it's not really that difficult to just do those steps manually or in a shell script.

@dmbaturin
Copy link
Collaborator

@egrieco Since 4.0.0, you can use the "save" hook to take over the output writing stage. The only shortcoming is that there's no Lua function that would allow you to send a string to external filter's stdin... however, it's not hard to add, it's just that I haven't had a use case for it yet and no one else asked me to add it.

If an HTML formatter supports modifying a file in-place, it's a non-issue, of course—you can just run it on the page file after writing it.

I wonder if I should also add a separate "post-write" hook specially for these cases, though.

@egrieco
Copy link

egrieco commented Jul 14, 2022

Yeah, I hadn't gotten around to looking at if an "asset pipeline" could be implemented directly within Soupault. This would be useful to generate several sizes of images and potentially several formats to use in scrsets.

P.S. @dmbaturin Soupault is one of the coolest and most useful pieces of software I've run across in at least a decade. You really saved my students. I've been wondering how I was going to go from basic "intro to web dev" to a static site generator without a lot of needless pain. Almost all of the generators have some major flaw that contributes to severe friction or limitations in what sites can be built.

I cannot thank you enough for Soupault. I have plenty more to say, but don't want to pollute this issue. :)

@dmbaturin
Copy link
Collaborator

@egrieco Maybe make a separate issue for discussions of post-processing. In fact, I do already have a plugin that handles assets in a non-trivial way: https://github.com/dmbaturin/iproute2-cheatsheet/blob/master/plugins/inline-assets.lua reads asset files and inlines them into the page (CSS and JS as is, images Base64 encoded).

@egrieco
Copy link

egrieco commented Jul 14, 2022

@dmbaturin Soupault just keeps getting better and better. :)

I haven't been playing with Soupault for even a full day yet. I'm setting up several sites in it now. Let me get a better handle on what it can actually do so I don't file any spurious issues.

In the meantime I sent you an email from my @egx.com address. My profound thanks for building Soupault.

@delan
Copy link

delan commented Dec 27, 2023

Intuitively, <pre> is the only element where leading and trailing whitespace around the element content should be significant, so my opinion is that all browsers should ignore it.

This is not really a safe assumption to make because of CSS. I ran into this with a retrocomputing website where I use “older” techniques like building navigation with nav > ul > li { display: inline-block }. Here’s a minimal example:

# soupault.toml
[settings]
  generator_mode = false
  pretty_print_html = false
<!-- site/index.html -->
<!doctype html>
<meta charset="utf-8">
<style>
    nav li {
        display: inline-block;
        outline: 1px solid;
        padding: 0.5em;
    }
</style>
<nav><ul>
    <li>home
    <!-- implied </li> --><li>about
    <!-- implied </li> --><li>projects
    <!-- implied </li> --><li>contact
<!-- implied </li> --></ul></nav>
pretty_print_html
false image
true image

It would be good for lambdasoup to prettify HTML in a way that doesn’t affect whitespace between elements, or even without affecting whitespace in text nodes at all (because it changes the DOM), but in the meantime, I’m happy to send a patch to warn about this in the docs and default toml if you like. Thanks for making soupault!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants