Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance for converting HTML headings to heading_* Token #813

Closed
shellscape opened this issue Aug 27, 2021 · 6 comments
Closed

Guidance for converting HTML headings to heading_* Token #813

shellscape opened this issue Aug 27, 2021 · 6 comments

Comments

@shellscape
Copy link

The author of markdown-it-anchor and I have been discussing how to handle headings that are in a markdown file as HTML. The Vue README has a few examples of these https://github.com/vuejs/vue/blob/dev/README.md. It's desirable to handle a <h2>Hello</h2> in HTML as we would ## Hello, and tokenize those headings so they can be processed by other plugins.

As I'm still learning the methodologies and best practices of markdown-it, I was hoping you might be able to provide guidance on the best method for processing the html and inserting tokens appropriately. I have a working proof of concept which splits html_block tokens using cheerio, and manually splicing in new Tokens, but a lot of that is manual lifting and I'd like to get your take on this before I march ahead with that. TIA

@shellscape
Copy link
Author

shellscape commented Aug 27, 2021

Here's what I've got so far. I'm sure there are things that need improvement, would love feedback:

import cheerio from 'cheerio';
import MarkdownIt from 'markdown-it';
import Token from 'markdown-it/lib/token';

export default function htmlHeaders(md: MarkdownIt) {
  md.core.ruler.after('inline', 'html-headers', (state) => {
    state.tokens.forEach((blockToken) => {
      if (blockToken.type !== 'html_block') {
        return;
      }
      const $ = cheerio.load(`${blockToken.content}`, { xmlMode: true });
      const headings = $('h1,h2,h3,h4,h5,h6');

      if (!headings.length) {
        return;
      }

      const { map } = blockToken;

      headings.each((_, e) => {
        const { tagName } = e;
        const level = parseInt(tagName.substring(1), 10);
        const markup = ''.padStart(level, '#');
        const element = $(e);

        const open = new Token('heading_open', tagName, 1);
        open.markup = markup;
        open.map = map;

        Object.entries(e.attribs).forEach(([key, value]) => {
          open.attrSet(key, value);
        });

        const content = new Token('text', '', 0);
        content.map = map;
        content.content = element.text() || '';

        const body = new Token('inline', '', 0);
        body.content = content.content;
        body.map = map;
        body.children = [content];

        const close = new Token('heading_close', tagName, -1);
        close.markup = markup;

        const position = state.tokens.indexOf(blockToken);
        state.tokens.splice(position, 0, open, body, close);

        element.remove();
      });

      // eslint-disable-next-line no-param-reassign
      blockToken.content = $.html();
    });

    return false;
  });
}

@puzrin
Copy link
Member

puzrin commented Aug 27, 2021

In general, i would propose to process HTML only after markdown is rendered to HTML. cheerio is a good choice.

If you wish to process html_block token and reinject it as markdown heading_* tokens - that's probably possible, but i don't like to give any guarantees. At first glance - may work as expected.

@shellscape
Copy link
Author

In general, i would propose to process HTML only after markdown is rendered to HTML.

Agreed. This is a specialized case in which we need markdown-it-anchor and markdown-it-toc-done-right to both process the HTML headings, which they can only do if those headings are represented by tokens in the stream.

Thanks for the feedback, if you spot anything that may be a concern on another glance, please do let me know.

@puzrin
Copy link
Member

puzrin commented Aug 27, 2021

I don't see obvious reasons, why your special case should not be used. It seems, you are qualified and understand well what you do.

Of cause, if you enable html, it worth to use sanitizer to restrict allowed tokens & attrs. But that's another story, not specific to your question. General approach can be scraped from npm's wrapper. They tweak markdown-it to behave very close to github (to render README files on npm.com)

@shellscape
Copy link
Author

Haha I had no idea that NPM had done something similar. Thanks for the tip! It looks like they went a similar, but different path with that https://github.com/npm/marky-markdown/blob/master/lib/plugin/html-heading.js

@puzrin
Copy link
Member

puzrin commented Aug 27, 2021

See also #28. Probably, there are security notes you should know about (and why such popular feature is not yet landed here). github (and marky-markdown) forces prefixes for anchor names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants