Guidance for converting HTML headings to heading_* Token #813

shellscape · 2021-08-27T14:56:14Z

The author of markdown-it-anchor and I have been discussing how to handle headings that are in a markdown file as HTML. The Vue README has a few examples of these https://github.com/vuejs/vue/blob/dev/README.md. It's desirable to handle a <h2>Hello</h2> in HTML as we would ## Hello, and tokenize those headings so they can be processed by other plugins.

As I'm still learning the methodologies and best practices of markdown-it, I was hoping you might be able to provide guidance on the best method for processing the html and inserting tokens appropriately. I have a working proof of concept which splits html_block tokens using cheerio, and manually splicing in new Tokens, but a lot of that is manual lifting and I'd like to get your take on this before I march ahead with that. TIA

The text was updated successfully, but these errors were encountered:

shellscape · 2021-08-27T16:08:10Z

Here's what I've got so far. I'm sure there are things that need improvement, would love feedback:

import cheerio from 'cheerio';
import MarkdownIt from 'markdown-it';
import Token from 'markdown-it/lib/token';

export default function htmlHeaders(md: MarkdownIt) {
  md.core.ruler.after('inline', 'html-headers', (state) => {
    state.tokens.forEach((blockToken) => {
      if (blockToken.type !== 'html_block') {
        return;
      }
      const $ = cheerio.load(`${blockToken.content}`, { xmlMode: true });
      const headings = $('h1,h2,h3,h4,h5,h6');

      if (!headings.length) {
        return;
      }

      const { map } = blockToken;

      headings.each((_, e) => {
        const { tagName } = e;
        const level = parseInt(tagName.substring(1), 10);
        const markup = ''.padStart(level, '#');
        const element = $(e);

        const open = new Token('heading_open', tagName, 1);
        open.markup = markup;
        open.map = map;

        Object.entries(e.attribs).forEach(([key, value]) => {
          open.attrSet(key, value);
        });

        const content = new Token('text', '', 0);
        content.map = map;
        content.content = element.text() || '';

        const body = new Token('inline', '', 0);
        body.content = content.content;
        body.map = map;
        body.children = [content];

        const close = new Token('heading_close', tagName, -1);
        close.markup = markup;

        const position = state.tokens.indexOf(blockToken);
        state.tokens.splice(position, 0, open, body, close);

        element.remove();
      });

      // eslint-disable-next-line no-param-reassign
      blockToken.content = $.html();
    });

    return false;
  });
}

puzrin · 2021-08-27T16:20:17Z

In general, i would propose to process HTML only after markdown is rendered to HTML. cheerio is a good choice.

If you wish to process html_block token and reinject it as markdown heading_* tokens - that's probably possible, but i don't like to give any guarantees. At first glance - may work as expected.

shellscape · 2021-08-27T16:24:34Z

In general, i would propose to process HTML only after markdown is rendered to HTML.

Agreed. This is a specialized case in which we need markdown-it-anchor and markdown-it-toc-done-right to both process the HTML headings, which they can only do if those headings are represented by tokens in the stream.

Thanks for the feedback, if you spot anything that may be a concern on another glance, please do let me know.

puzrin · 2021-08-27T16:35:55Z

I don't see obvious reasons, why your special case should not be used. It seems, you are qualified and understand well what you do.

Of cause, if you enable html, it worth to use sanitizer to restrict allowed tokens & attrs. But that's another story, not specific to your question. General approach can be scraped from npm's wrapper. They tweak markdown-it to behave very close to github (to render README files on npm.com)

shellscape · 2021-08-27T16:48:23Z

Haha I had no idea that NPM had done something similar. Thanks for the tip! It looks like they went a similar, but different path with that https://github.com/npm/marky-markdown/blob/master/lib/plugin/html-heading.js

puzrin · 2021-08-27T17:17:47Z

See also #28. Probably, there are security notes you should know about (and why such popular feature is not yet landed here). github (and marky-markdown) forces prefixes for anchor names.

shellscape closed this as completed Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance for converting HTML headings to heading_* Token #813

Guidance for converting HTML headings to heading_* Token #813

shellscape commented Aug 27, 2021

shellscape commented Aug 27, 2021 •

edited

puzrin commented Aug 27, 2021

shellscape commented Aug 27, 2021

puzrin commented Aug 27, 2021

shellscape commented Aug 27, 2021

puzrin commented Aug 27, 2021

Guidance for converting HTML headings to heading_* Token #813

Guidance for converting HTML headings to heading_* Token #813

Comments

shellscape commented Aug 27, 2021

shellscape commented Aug 27, 2021 • edited

puzrin commented Aug 27, 2021

shellscape commented Aug 27, 2021

puzrin commented Aug 27, 2021

shellscape commented Aug 27, 2021

puzrin commented Aug 27, 2021

shellscape commented Aug 27, 2021 •

edited