Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML.create_element handles void elements incorrectly #66

Open
delan opened this issue Aug 9, 2024 · 2 comments
Open

HTML.create_element handles void elements incorrectly #66

delan opened this issue Aug 9, 2024 · 2 comments

Comments

@delan
Copy link

delan commented Aug 9, 2024

Void elements only have a start tag; end tags must not be specified for void elements.

The following yields <!DOCTYPE html><img></img><br></br>. For the <img> this should have no negative effects, but for the <br> this parses as two <br> elements per #parsing-main-inbody of the HTML spec.

HTML.delete_content(page)
HTML.append_root(page, HTML.create_element("img"))
HTML.append_root(page, HTML.create_element("br"))

One workaround to get <!DOCTYPE html><img><br> is to use HTML.parse():

HTML.delete_content(page)
HTML.append_root(page, HTML.parse("<img>"))
HTML.append_root(page, HTML.parse("<br>"))

In some situations, the HTML.parse() needs to be wrapped in a HTML.select_one():

-- [ERROR] Could not process page: Expected an HTML element node, but found a document
-- img = HTML.parse("<img>")
img = HTML.select_one(HTML.parse("<img>"), "*")
HTML.set_attribute(img, "src", "https://soupault.app/images/soupault_logo.svg")
HTML.append_root(page, img)
@delan delan changed the title HTML.create_element handles self-closing tags incorrectly HTML.create_element handles void elements incorrectly Aug 9, 2024
@dmbaturin
Copy link
Collaborator

I couldn't reproduce this issue with 4.10.

[dmbaturin@alcor ~/d/t/brtest]$ soupault --version
soupault 4.10.0

Copyright 2024 Daniil Baturin et al.
soupault is free software distributed under the MIT license.
Visit https://www.soupault.app for news and documentation.

Compiled with OCaml 4.14.2

[dmbaturin@alcor ~/d/t/brtest]$ cat soupault.toml 

# To learn about configuring soupault, visit https://www.soupault.app/reference-manual

[settings]
  # Soupault version that the config was written/generated for
  # Trying to process this config with an older version will result in an error message
  soupault_version = "4.10.0"

  # Stop on page processing errors?
  strict = true

  # Display progress?
  verbose = true

  # Display detailed debug output?
  debug = false

  # Where input files (pages and assets) are stored.
  site_dir = "site"

  # Where the output goes
  build_dir = "build"

  # Files inside the site/ directory can be treated as pages or static assets,
  # depending on the extension.
  #
  # Files with extensions from this list are considered pages and processed.
  # All other files are copied to build/ unchanged.
  #
  # Note that for formats other than HTML, you need to specify an external program
  # for converting them to HTML (see below).
  page_file_extensions = ["htm", "html", "md", "rst", "adoc"]

  # By default, soupault uses "clean URLs",
  # that is, $site_dir/page.html is converted to $build_dir/page/index.html
  # You can make it produce $build_dir/page.tml instead by changing this option to false
  clean_urls = true

  # If you set clean_urls=false,
  # file names with ".html" and ".htm" extensions are left unchanged.
  keep_extensions = ["html", "htm"]

  # All other extensions (".md", ".rst"...) are replaced, by default with ".html"
  default_extension = "html"

  # Page files with these extensions are ignored.
  ignore_extensions = ["draft"]

  # Soupault can work as a website generator or an HTML processor.
  #
  # In the "website generator" mode, it considers files in site/ page bodies
  # and inserts them into the empty page template stored in templates/main.html
  #
  # Setting this option to false switches it to the "HTML processor" mode
  # when it considers every file in site/ a complete page and only runs it through widgets/plugins.
  generator_mode = true

  # Files that contain an <html> element are considered complete pages rather than page bodies,
  # even in the "website generator" mode.
  # This allows you to use a unique layout for some pages and still have them processed by widgets.
  complete_page_selector = "html"

  # Website generator mode requires a page template (an empty page to insert a page body into).
  # If you use "generator_mode = false", this file is not required.
  default_template_file = "templates/main.html"

  # Page content is inserted into a certain element of the page template.
  # This option is a CSS selector that is used for locating that element.
  # By default the content is inserted into the <body>
  default_content_selector = "body"

  # You can choose where exactly to insert the content in its parent element.
  # The default is append_child, but there are more, including prepend_child and replace_content
  default_content_action = "append_child"

  # If a page already has a document type declaration, keep the declaration
  keep_doctype = true

  # If a page does not have a document type declaration, force it to HTML5
  # With keep_doctype=false, soupault will replace existing declarations with it too
  doctype = "<!DOCTYPE html>"

  # Insert whitespace into HTML for better readability
  # When set to false, the original whitespace (if any) will be preserved as is
  pretty_print_html = true

  # Plugins can be either automatically discovered or loaded explicitly.
  # By default discovery is enabled and the place where soupault is looking is the plugins/ subdirectory
  # in your project.
  # E.g., a file at plugins/my-plugin.lua will be registered as a widget named "my-plugin".
  plugin_discovery = true
  plugin_dirs = ["plugins"]

  # Soupault can cache outputs of external programs
  # (page preprocessors and preprocess_element widget commands).
  # It's disabled by default but you can enable it and configure the cache directory name/path
  caching = false
  cache_dir = ".soupault-cache"

  # Soupault supports a variety of page source character encodings,
  # the default encoding is UTF-8
  page_character_encoding = "utf-8"
 

# It is possible to store pages in any format if you have a program
# that converts it to HTML and writes it to standard output.
# Example:
#[preprocessors]
#  md = "cmark --unsafe --smart"
#  adoc = "asciidoctor -o -"

# Pages can be further processed with "widgets"

# Takes the content of the first <h1> and inserts it into the <title>
[widgets.page-title]
  widget = "title"
  selector = "h1"
  default = "My Homepage"
  append = " &mdash; My Homepage"

  # Insert a <title> in a page if it doesn't have one already.
  # By default soupault assumes if it's missing, you don't want it.
  force = false

# Inserts a generator meta tag in the page <head>
# Just for demonstration, feel free to remove
[widgets.generator-meta]
  widget = "insert_html"
  html = '<meta name="generator" content="soupault">'
  selector = "head"

# <blink> elements are evil, delete them all
[widgets.no-blink]
  widget = "delete_element"
  selector = "blink"

  # By default this widget deletes all elements matching the selector,
  # but you can set this option to false to delete just the first one
  delete_all = true

[widgets.test]
  widget = "test"

[dmbaturin@alcor ~/d/t/brtest]$ cat plugins/test.lua 
HTML.delete_content(page)
HTML.append_root(page, HTML.create_element("img"))
HTML.append_root(page, HTML.create_element("br"))
[dmbaturin@alcor ~/d/t/brtest]$ soupault 
[INFO] Starting soupault 4.10.0 in website generator mode
[INFO] Loading plugins
[INFO] Loading widgets
[INFO] Loading hooks
[INFO] Starting website build
[INFO] Processing page site/index.html
[INFO] Using the default template for page site/index.html
[INFO] Processing widget generator-meta on page site/index.html
[INFO] Processing widget page-title on page site/index.html
[INFO] Processing widget test on page site/index.html
[INFO] Processing widget no-blink on page site/index.html
[INFO] Writing generated page to build/index.html

[dmbaturin@alcor ~/d/t/brtest]$ cat build/index.html 
<!DOCTYPE html>
<img><br>                                                                                   

@dmbaturin
Copy link
Collaborator

Hmm, one idea: does the original page has a doctype that allows void elements, like <!DOCTYPE html>? The doctype does affect the parsing and rendering mode selection in Markup.ml/LambdaSoup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants