Question: Position of a token in the source string #2134

nidoro · 2021-07-12T19:03:34Z

I'm working on a project (a syntax highlight for an editor) which require me to have access to the position of a token within the source string. After scanning through the lexer and parsing documentation I didn't find a way to do so. Ideally, for my use case, the tokens returned by the lex(...) function would contain the character position (line number and column number) of the start and end of the token (or token raw size, which I think it's already available).

Is there already a way to do know the position of each token? If not, consider this a feature proposal :) I'm sure it is an easy thing to add.

The text was updated successfully, but these errors were encountered:

calculuschild · 2021-07-12T19:39:34Z

You are correct, we do not currently log the string positions of the tokens.

You may be able to get something to work with the walkTokens feature by tracking the sum of the token "raw" lengths and adding a property to each token with the current total. Things would get more complex once you start getting into sub-tokens though but it should be possible.

UziTech · 2021-07-12T22:30:45Z

I'm sure it is an easy thing to add

@nidoro We always appreciate PRs 😁👍

nidoro · 2021-07-13T14:34:29Z

I just started using the library, and my knowledge of its inner workings is too little to make a pull request. But I did some changes that seem to be working. I'll explain what I did and would really appreciate your feedback to make sure I'm doing things correctly. I did some testing and things are working 99% of the time, but I'm still missing something.

I've modified the lexer so that it returns the position of the token in the source. So each token returned by the lex(...) function have two new members: start: {line, column, index} and end: {line, column, index}. For my use case, I only need the line and column, but I went ahead and included the index in case other users need it. Also, the interval [start, end] is inclusive, meaning the end is part of the token. My changes can be summarized in four steps:

Three Lexer functions have been modified to accept an at parameter, which indicates where we are in the source file (at: {line, column, index}).

blockTokens(src, tokens, top, at)
inline(tokens, at)
inlineTokens(src, tokens, at, inLink, inRawBlock)

The lex(...) function now looks like this:

function lex(src) {
      src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, '    ');
      
      let at = {line: 0, column: 0, index: 0};
      this.blockTokens(src, this.tokens, true, at); 
      at = {line: 0, column: 0, index: 0}; 
      this.inline(this.tokens, at);
      return this.tokens;
}

I've implemented three helper functions in the Lexer:

function copyAt(at) {
      return {line: at.line, column: at.column, index: at.index};
}
 
// Advances the 'at' iterator by 'count' characters.
function advance(src, at, count) {
      for (let i = 0; i < count; ++i) {
        let c = src[i];
        if (c == '\n') {
          ++at.line;
          at.column = 0;
        } else {
          ++at.column;
        }
        ++at.index;
      }
}

// Eats the token that starts 'src', meaning it sets the token
// start and end positions, advances the 'at' iterator to skip
// the token and returns the remaining string.
function eatToken(src, token, at) {
      token.start = this.copyAt(at);
      this.advance(src, at, token.raw.length-1);
      token.end   = this.copyAt(at);
      this.advance(src[token.raw.length-1], at, 1);
      return src.substring(token.raw.length);
}

Finally, now it is just a matter of searching and replacing some function calls. The occurrences of src = src.substring(token.raw.length); have been replaced by src = this.eatToken(src, token, at). And the calls to the functions blockTokens(...), inline(...) and inlineTokens(...) now include the parameter at.

I think the at parameter passed at these function calls sometimes need to be a copy rather than a reference, but I'm not sure when. You can see when I passed a copy rather than a reference bellow. I've made the changes directly to /lib/marked.js, which I figured was the quickest and dirtiest way for me to test it.

Click to see Lexer changes (I've only pasted the "Block Lexer" section of the file, which contains all the changes)

  /**
   * Block Lexer
   */


  var Lexer_1 = /*#__PURE__*/function () {
    function Lexer(options) {
      this.tokens = [];
      this.tokens.links = Object.create(null);
      this.options = options || defaults$3;
      this.options.tokenizer = this.options.tokenizer || new Tokenizer$1();
      this.tokenizer = this.options.tokenizer;
      this.tokenizer.options = this.options;
      var rules = {
        block: block.normal,
        inline: inline.normal
      };

      if (this.options.pedantic) {
        rules.block = block.pedantic;
        rules.inline = inline.pedantic;
      } else if (this.options.gfm) {
        rules.block = block.gfm;

        if (this.options.breaks) {
          rules.inline = inline.breaks;
        } else {
          rules.inline = inline.gfm;
        }
      }

      this.tokenizer.rules = rules;
    }
    /**
     * Expose Rules
     */


    /**
     * Static Lex Method
     */
    Lexer.lex = function lex(src, options) {
      var lexer = new Lexer(options);
      return lexer.lex(src);
    }
    /**
     * Static Lex Inline Method
     */
    ;

    Lexer.lexInline = function lexInline(src, options) {
      var lexer = new Lexer(options);
      return lexer.inlineTokens(src);
    }
    /**
     * Preprocessing
     */
    ;

    var _proto = Lexer.prototype;

    _proto.lex = function lex(src) {
      src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, '    ');
      
      let at = {line: 0, column: 0, index: 0};
      
      this.blockTokens(src, this.tokens, true, at);
      
      at = {line: 0, column: 0, index: 0};
      
      this.inline(this.tokens, at);
      return this.tokens;
    }
    /**
     * Lexing
     */
    ;
    
    _proto.copyAt = function copyAt(at) {
      return {line: at.line, column: at.column, index: at.index};
    }
    
    _proto.advance = function advance(src, at, count) {
      for (let i = 0; i < count; ++i) {
        let c = src[i];
        if (c == '\n') {
          ++at.line;
          at.column = 0;
        } else {
          ++at.column;
        }
        ++at.index;
      }
    }
    
    _proto.eatToken = function eatToken(src, token, at) {
      token.start = this.copyAt(at);
      this.advance(src, at, token.raw.length-1);
      token.end   = this.copyAt(at);
      this.advance(src[token.raw.length-1], at, 1);
      return src.substring(token.raw.length);
    }

    _proto.blockTokens = function blockTokens(src, tokens, top, at) {
      var _this = this;

      if (tokens === void 0) {
        tokens = [];
      }

      if (top === void 0) {
        top = true;
      }
      
      if (at === void 0) {
        at = {line: 0, column: 0, index: 0};
      }

      if (this.options.pedantic) {
        src = src.replace(/^ +$/gm, '');
      }

      var token, i, l, lastToken, cutSrc, lastParagraphClipped;

      while (src) {
        if (this.options.extensions && this.options.extensions.block && this.options.extensions.block.some(function (extTokenizer) {
          if (token = extTokenizer.call(_this, src, tokens)) {
            src = this.eatToken(src, token, at);
            tokens.push(token);
            return true;
          }

          return false;
        })) {
          continue;
        } // newline


        if (token = this.tokenizer.space(src)) {
          src = this.eatToken(src, token, at);

          if (token.type) {
            tokens.push(token);
          }

          continue;
        } // code


        if (token = this.tokenizer.code(src)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1]; // An indented code block cannot interrupt a paragraph.

          if (lastToken && lastToken.type === 'paragraph') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // fences


        if (token = this.tokenizer.fences(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // heading


        if (token = this.tokenizer.heading(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // table no leading pipe (gfm)


        if (token = this.tokenizer.nptable(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // hr


        if (token = this.tokenizer.hr(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // blockquote


        if (token = this.tokenizer.blockquote(src)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.blockTokens(token.text, [], top, this.copyAt(at));
          tokens.push(token);
          continue;
        } // list


        if (token = this.tokenizer.list(src)) {
          src = this.eatToken(src, token, at);
          l = token.items.length;

          for (i = 0; i < l; i++) {
            token.items[i].tokens = this.blockTokens(token.items[i].text, [], false, this.copyAt(at));
          }

          tokens.push(token);
          continue;
        } // html


        if (token = this.tokenizer.html(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // def


        if (top && (token = this.tokenizer.def(src))) {
          src = this.eatToken(src, token, at);

          if (!this.tokens.links[token.tag]) {
            this.tokens.links[token.tag] = {
              href: token.href,
              title: token.title
            };
          }

          continue;
        } // table (gfm)


        if (token = this.tokenizer.table(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // lheading


        if (token = this.tokenizer.lheading(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // top-level paragraph
        // prevent paragraph consuming extensions by clipping 'src' to extension start


        cutSrc = src;

        if (this.options.extensions && this.options.extensions.startBlock) {
          (function () {
            var startIndex = Infinity;
            var tempSrc = src.slice(1);
            var tempStart = void 0;

            _this.options.extensions.startBlock.forEach(function (getStartIndex) {
              tempStart = getStartIndex.call(this, tempSrc);

              if (typeof tempStart === 'number' && tempStart >= 0) {
                startIndex = Math.min(startIndex, tempStart);
              }
            });

            if (startIndex < Infinity && startIndex >= 0) {
              cutSrc = src.substring(0, startIndex + 1);
            }
          })();
        }

        if (top && (token = this.tokenizer.paragraph(cutSrc))) {
          lastToken = tokens[tokens.length - 1];

          if (lastParagraphClipped && lastToken.type === 'paragraph') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          lastParagraphClipped = cutSrc.length !== src.length;
          src = this.eatToken(src, token, at);
          continue;
        } // text


        if (token = this.tokenizer.text(src)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1];

          if (lastToken && lastToken.type === 'text') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          continue;
        }

        if (src) {
          var errMsg = 'Infinite loop on byte: ' + src.charCodeAt(0);

          if (this.options.silent) {
            console.error(errMsg);
            break;
          } else {
            throw new Error(errMsg);
          }
        }
      }

      return tokens;
    };

    _proto.inline = function inline(tokens, at) {
      var i, j, k, l2, row, token;
      var l = tokens.length;

      for (i = 0; i < l; i++) {
        token = tokens[i];

        switch (token.type) {
          case 'paragraph':
          case 'text':
          case 'heading':
            {
              token.tokens = [];
              this.inlineTokens(token.text, token.tokens, {line: token.start.line, column: token.start.column});
              break;
            }

          case 'table':
            {
              token.tokens = {
                header: [],
                cells: []
              }; // header

              l2 = token.header.length;

              for (j = 0; j < l2; j++) {
                token.tokens.header[j] = [];
                this.inlineTokens(token.header[j], token.tokens.header[j], at);
              } // cells


              l2 = token.cells.length;

              for (j = 0; j < l2; j++) {
                row = token.cells[j];
                token.tokens.cells[j] = [];

                for (k = 0; k < row.length; k++) {
                  token.tokens.cells[j][k] = [];
                  this.inlineTokens(row[k], token.tokens.cells[j][k], at);
                }
              }

              break;
            }

          case 'blockquote':
            {
              this.inline(token.tokens, at);
              break;
            }

          case 'list':
            {
              l2 = token.items.length;

              for (j = 0; j < l2; j++) {
                this.inline(token.items[j].tokens, at);
              }

              break;
            }
        }
      }

      return tokens;
    }
    /**
     * Lexing/Compiling
     */
    ;

    _proto.inlineTokens = function inlineTokens(src, tokens, at, inLink, inRawBlock) {
      var _this2 = this;
      
      if (at === void 0) {
        at = {line: 0, column: 0, index: 0};
      }

      if (tokens === void 0) {
        tokens = [];
      }

      if (inLink === void 0) {
        inLink = false;
      }

      if (inRawBlock === void 0) {
        inRawBlock = false;
      }

      var token, lastToken, cutSrc; // String with links masked to avoid interference with em and strong

      var maskedSrc = src;
      var match;
      var keepPrevChar, prevChar; // Mask out reflinks

      if (this.tokens.links) {
        var links = Object.keys(this.tokens.links);

        if (links.length > 0) {
          while ((match = this.tokenizer.rules.inline.reflinkSearch.exec(maskedSrc)) != null) {
            if (links.includes(match[0].slice(match[0].lastIndexOf('[') + 1, -1))) {
              maskedSrc = maskedSrc.slice(0, match.index) + '[' + repeatString('a', match[0].length - 2) + ']' + maskedSrc.slice(this.tokenizer.rules.inline.reflinkSearch.lastIndex);
            }
          }
        }
      } // Mask out other blocks


      while ((match = this.tokenizer.rules.inline.blockSkip.exec(maskedSrc)) != null) {
        maskedSrc = maskedSrc.slice(0, match.index) + '[' + repeatString('a', match[0].length - 2) + ']' + maskedSrc.slice(this.tokenizer.rules.inline.blockSkip.lastIndex);
      } // Mask out escaped em & strong delimiters


      while ((match = this.tokenizer.rules.inline.escapedEmSt.exec(maskedSrc)) != null) {
        maskedSrc = maskedSrc.slice(0, match.index) + '++' + maskedSrc.slice(this.tokenizer.rules.inline.escapedEmSt.lastIndex);
      }

      while (src) {
        if (!keepPrevChar) {
          prevChar = '';
        }

        keepPrevChar = false; // extensions

        if (this.options.extensions && this.options.extensions.inline && this.options.extensions.inline.some(function (extTokenizer) {
          if (token = extTokenizer.call(_this2, src, tokens)) {
            src = this.eatToken(src, token, at);
            tokens.push(token);
            return true;
          }

          return false;
        })) {
          continue;
        } // escape


        if (token = this.tokenizer.escape(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // tag


        if (token = this.tokenizer.tag(src, inLink, inRawBlock)) {
          src = this.eatToken(src, token, at);
          inLink = token.inLink;
          inRawBlock = token.inRawBlock;
          lastToken = tokens[tokens.length - 1];

          if (lastToken && token.type === 'text' && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // link


        if (token = this.tokenizer.link(src)) {
          src = this.eatToken(src, token, at);

          if (token.type === 'link') {
            token.tokens = this.inlineTokens(token.text, [], this.copyAt(at), true, inRawBlock);
          }

          tokens.push(token);
          continue;
        } // reflink, nolink


        if (token = this.tokenizer.reflink(src, this.tokens.links)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1];

          if (token.type === 'link') {
            token.tokens = this.inlineTokens(token.text, [], this.copyAt(at), true, inRawBlock);
            tokens.push(token);
          } else if (lastToken && token.type === 'text' && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // em & strong


        if (token = this.tokenizer.emStrong(src, maskedSrc, prevChar)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.inlineTokens(token.text, [], this.copyAt(at), inLink, inRawBlock);
          tokens.push(token);
          continue;
        } // code


        if (token = this.tokenizer.codespan(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // br


        if (token = this.tokenizer.br(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // del (gfm)


        if (token = this.tokenizer.del(src)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.inlineTokens(token.text, [], this.copyAt(at), inLink, inRawBlock);
          tokens.push(token);
          continue;
        } // autolink


        if (token = this.tokenizer.autolink(src, mangle)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // url (gfm)


        if (!inLink && (token = this.tokenizer.url(src, mangle))) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // text
        // prevent inlineText consuming extensions by clipping 'src' to extension start


        cutSrc = src;

        if (this.options.extensions && this.options.extensions.startInline) {
          (function () {
            var startIndex = Infinity;
            var tempSrc = src.slice(1);
            var tempStart = void 0;

            _this2.options.extensions.startInline.forEach(function (getStartIndex) {
              tempStart = getStartIndex.call(this, tempSrc);

              if (typeof tempStart === 'number' && tempStart >= 0) {
                startIndex = Math.min(startIndex, tempStart);
              }
            });

            if (startIndex < Infinity && startIndex >= 0) {
              cutSrc = src.substring(0, startIndex + 1);
            }
          })();
        }

        if (token = this.tokenizer.inlineText(cutSrc, inRawBlock, smartypants)) {
          src = this.eatToken(src, token, at);

          if (token.raw.slice(-1) !== '_') {
            // Track prevChar before string of ____ started
            prevChar = token.raw.slice(-1);
          }

          keepPrevChar = true;
          lastToken = tokens[tokens.length - 1];

          if (lastToken && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        }

        if (src) {
          var errMsg = 'Infinite loop on byte: ' + src.charCodeAt(0);

          if (this.options.silent) {
            console.error(errMsg);
            break;
          } else {
            throw new Error(errMsg);
          }
        }
      }

      return tokens;
    };

    _createClass(Lexer, null, [{
      key: "rules",
      get: function get() {
        return {
          block: block,
          inline: inline
        };
      }
    }]);

    return Lexer;
  }();

Like I said, everything seems to be working 99% os the time, but I've noticed an incorrect result for the following markdown source, and I suspect there are other cases that would generate incorrect results:

> quote
> > > quote
# test

paragraph

I appreciate any help on this. Thank you for the library!

UziTech · 2021-07-13T22:31:59Z

I think this is going to be much harder (nearly impossible) because of the line src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, ' ');. If the user uses tabs we won't be able to tell if four spaces are supposed to be one character or four.

UziTech · 2021-07-13T22:36:37Z

also this looks like it is going to slow marked down a lot checking every character for \n

nidoro · 2021-07-14T01:49:57Z

You raise valid points, but depending on the use case, they may or may not be of great importance. For instance, in my use case, we only feed marked with tab-free input. And about the hit on the performance, this is a trade-off I'm willing to make, and possibly other users too. If it gets too out of hand (which will only happen for large files), I can try running it asynchronously.

Anyway, I do understand that this is not a highly demanded feature, but I do think this is an improvement on the library. Maybe you can make it optional, so the default behavior stays the same, but you have a returnTokenPosition boolean option to activate this.

I've fixed some bugs of my previous code. It is still not perfect, but it is better. The only (hopefully) problem I'm still having is with nested styles, like emphasis inside lists. But I think it is just a matter of time to get it 100%.

Click to show code

  /**
   * Block Lexer
   */


  var Lexer_1 = /*#__PURE__*/function () {
    function Lexer(options) {
      this.tokens = [];
      this.tokens.links = Object.create(null);
      this.options = options || defaults$3;
      this.options.tokenizer = this.options.tokenizer || new Tokenizer$1();
      this.tokenizer = this.options.tokenizer;
      this.tokenizer.options = this.options;
      var rules = {
        block: block.normal,
        inline: inline.normal
      };

      if (this.options.pedantic) {
        rules.block = block.pedantic;
        rules.inline = inline.pedantic;
      } else if (this.options.gfm) {
        rules.block = block.gfm;

        if (this.options.breaks) {
          rules.inline = inline.breaks;
        } else {
          rules.inline = inline.gfm;
        }
      }

      this.tokenizer.rules = rules;
    }
    /**
     * Expose Rules
     */


    /**
     * Static Lex Method
     */
    Lexer.lex = function lex(src, options) {
      var lexer = new Lexer(options);
      return lexer.lex(src);
    }
    /**
     * Static Lex Inline Method
     */
    ;

    Lexer.lexInline = function lexInline(src, options) {
      var lexer = new Lexer(options);
      return lexer.inlineTokens(src);
    }
    /**
     * Preprocessing
     */
    ;

    var _proto = Lexer.prototype;

    _proto.lex = function lex(src) {
      src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, '    ');
      
      let at = {line: 0, column: 0, index: 0};
      
      this.blockTokens(src, this.tokens, true, at);
      
      at = {line: 0, column: 0, index: 0};
      
      this.inline(this.tokens, at);
      return this.tokens;
    }
    /**
     * Lexing
     */
    ;
    
    _proto.copyAt = function copyAt(at) {
      return {line: at.line, column: at.column, index: at.index};
    }
    
    _proto.advance = function advance(src, at, count) {
      for (let i = 0; i < count; ++i) {
        let c = src[i];
        if (c == '\n') {
          ++at.line;
          at.column = 0;
        } else {
          ++at.column;
        }
        ++at.index;
      }
    }
    
    _proto.eatToken = function eatToken(src, token, at) {
      let textStartOffset = src.indexOf(token.text);
      token.textStart = this.copyAt(at);
      this.advance(src, token.textStart, textStartOffset);
      
      token.start = this.copyAt(at);
      this.advance(src, at, token.raw.length-1);
      token.end   = this.copyAt(at);
      this.advance(src[token.raw.length-1], at, 1);
      return src.substring(token.raw.length);
    }

    _proto.blockTokens = function blockTokens(src, tokens, top, at) {
      var _this = this;

      if (tokens === void 0) {
        tokens = [];
      }

      if (top === void 0) {
        top = true;
      }
      
      if (at === void 0) {
        at = {line: 0, column: 0, index: 0};
      }

      if (this.options.pedantic) {
        src = src.replace(/^ +$/gm, '');
      }

      var token, i, l, lastToken, cutSrc, lastParagraphClipped;

      while (src) {
        if (this.options.extensions && this.options.extensions.block && this.options.extensions.block.some(function (extTokenizer) {
          if (token = extTokenizer.call(_this, src, tokens)) {
            src = this.eatToken(src, token, at);
            tokens.push(token);
            return true;
          }

          return false;
        })) {
          continue;
        } // newline


        if (token = this.tokenizer.space(src)) {
          src = this.eatToken(src, token, at);

          if (token.type) {
            tokens.push(token);
          }

          continue;
        } // code


        if (token = this.tokenizer.code(src)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1]; // An indented code block cannot interrupt a paragraph.

          if (lastToken && lastToken.type === 'paragraph') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // fences


        if (token = this.tokenizer.fences(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // heading


        if (token = this.tokenizer.heading(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // table no leading pipe (gfm)


        if (token = this.tokenizer.nptable(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // hr


        if (token = this.tokenizer.hr(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // blockquote


        if (token = this.tokenizer.blockquote(src)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.blockTokens(token.text, [], top, this.copyAt(token.textStart));
          tokens.push(token);
          continue;
        } // list


        if (token = this.tokenizer.list(src)) {
          src = this.eatToken(src, token, at);
          l = token.items.length;

          for (i = 0; i < l; i++) {
            token.items[i].tokens = this.blockTokens(token.items[i].text, [], false, this.copyAt(token.textStart));
          }

          tokens.push(token);
          continue;
        } // html


        if (token = this.tokenizer.html(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // def


        if (top && (token = this.tokenizer.def(src))) {
          src = this.eatToken(src, token, at);

          if (!this.tokens.links[token.tag]) {
            this.tokens.links[token.tag] = {
              href: token.href,
              title: token.title
            };
          }

          continue;
        } // table (gfm)


        if (token = this.tokenizer.table(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // lheading


        if (token = this.tokenizer.lheading(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // top-level paragraph
        // prevent paragraph consuming extensions by clipping 'src' to extension start


        cutSrc = src;

        if (this.options.extensions && this.options.extensions.startBlock) {
          (function () {
            var startIndex = Infinity;
            var tempSrc = src.slice(1);
            var tempStart = void 0;

            _this.options.extensions.startBlock.forEach(function (getStartIndex) {
              tempStart = getStartIndex.call(this, tempSrc);

              if (typeof tempStart === 'number' && tempStart >= 0) {
                startIndex = Math.min(startIndex, tempStart);
              }
            });

            if (startIndex < Infinity && startIndex >= 0) {
              cutSrc = src.substring(0, startIndex + 1);
            }
          })();
        }

        if (top && (token = this.tokenizer.paragraph(cutSrc))) {
          lastToken = tokens[tokens.length - 1];

          if (lastParagraphClipped && lastToken.type === 'paragraph') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          lastParagraphClipped = cutSrc.length !== src.length;
          src = this.eatToken(src, token, at);
          continue;
        } // text


        if (token = this.tokenizer.text(src)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1];

          if (lastToken && lastToken.type === 'text') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          continue;
        }

        if (src) {
          var errMsg = 'Infinite loop on byte: ' + src.charCodeAt(0);

          if (this.options.silent) {
            console.error(errMsg);
            break;
          } else {
            throw new Error(errMsg);
          }
        }
      }

      return tokens;
    };

    _proto.inline = function inline(tokens, at) {
      var i, j, k, l2, row, token;
      var l = tokens.length;

      for (i = 0; i < l; i++) {
        token = tokens[i];

        switch (token.type) {
          case 'paragraph':
          case 'text':
          case 'heading':
            {
              token.tokens = [];
              this.inlineTokens(token.text, token.tokens, this.copyAt(token.textStart));
              break;
            }

          case 'table':
            {
              token.tokens = {
                header: [],
                cells: []
              }; // header

              l2 = token.header.length;

              for (j = 0; j < l2; j++) {
                token.tokens.header[j] = [];
                this.inlineTokens(token.header[j], token.tokens.header[j], at);
              } // cells


              l2 = token.cells.length;

              for (j = 0; j < l2; j++) {
                row = token.cells[j];
                token.tokens.cells[j] = [];

                for (k = 0; k < row.length; k++) {
                  token.tokens.cells[j][k] = [];
                  this.inlineTokens(row[k], token.tokens.cells[j][k], at);
                }
              }

              break;
            }

          case 'blockquote':
            {
              this.inline(token.tokens, at);
              break;
            }

          case 'list':
            {
              l2 = token.items.length;

              for (j = 0; j < l2; j++) {
                this.inline(token.items[j].tokens, at);
              }

              break;
            }
        }
      }

      return tokens;
    }
    /**
     * Lexing/Compiling
     */
    ;

    _proto.inlineTokens = function inlineTokens(src, tokens, at, inLink, inRawBlock) {
      var _this2 = this;
      
      if (at === void 0) {
        at = {line: 0, column: 0, index: 0};
      }

      if (tokens === void 0) {
        tokens = [];
      }

      if (inLink === void 0) {
        inLink = false;
      }

      if (inRawBlock === void 0) {
        inRawBlock = false;
      }

      var token, lastToken, cutSrc; // String with links masked to avoid interference with em and strong

      var maskedSrc = src;
      var match;
      var keepPrevChar, prevChar; // Mask out reflinks

      if (this.tokens.links) {
        var links = Object.keys(this.tokens.links);

        if (links.length > 0) {
          while ((match = this.tokenizer.rules.inline.reflinkSearch.exec(maskedSrc)) != null) {
            if (links.includes(match[0].slice(match[0].lastIndexOf('[') + 1, -1))) {
              maskedSrc = maskedSrc.slice(0, match.index) + '[' + repeatString('a', match[0].length - 2) + ']' + maskedSrc.slice(this.tokenizer.rules.inline.reflinkSearch.lastIndex);
            }
          }
        }
      } // Mask out other blocks


      while ((match = this.tokenizer.rules.inline.blockSkip.exec(maskedSrc)) != null) {
        maskedSrc = maskedSrc.slice(0, match.index) + '[' + repeatString('a', match[0].length - 2) + ']' + maskedSrc.slice(this.tokenizer.rules.inline.blockSkip.lastIndex);
      } // Mask out escaped em & strong delimiters


      while ((match = this.tokenizer.rules.inline.escapedEmSt.exec(maskedSrc)) != null) {
        maskedSrc = maskedSrc.slice(0, match.index) + '++' + maskedSrc.slice(this.tokenizer.rules.inline.escapedEmSt.lastIndex);
      }

      while (src) {
        if (!keepPrevChar) {
          prevChar = '';
        }

        keepPrevChar = false; // extensions

        if (this.options.extensions && this.options.extensions.inline && this.options.extensions.inline.some(function (extTokenizer) {
          if (token = extTokenizer.call(_this2, src, tokens)) {
            src = this.eatToken(src, token, at);
            tokens.push(token);
            return true;
          }

          return false;
        })) {
          continue;
        } // escape


        if (token = this.tokenizer.escape(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // tag


        if (token = this.tokenizer.tag(src, inLink, inRawBlock)) {
          src = this.eatToken(src, token, at);
          inLink = token.inLink;
          inRawBlock = token.inRawBlock;
          lastToken = tokens[tokens.length - 1];

          if (lastToken && token.type === 'text' && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // link


        if (token = this.tokenizer.link(src)) {
          src = this.eatToken(src, token, at);

          if (token.type === 'link') {
            token.tokens = this.inlineTokens(token.text, [], this.copyAt(token.textStart), true, inRawBlock);
          }

          tokens.push(token);
          continue;
        } // reflink, nolink


        if (token = this.tokenizer.reflink(src, this.tokens.links)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1];

          if (token.type === 'link') {
            token.tokens = this.inlineTokens(token.text, [], this.copyAt(token.textStart), true, inRawBlock);
            tokens.push(token);
          } else if (lastToken && token.type === 'text' && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // em & strong


        if (token = this.tokenizer.emStrong(src, maskedSrc, prevChar)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.inlineTokens(token.text, [], this.copyAt(token.textStart), inLink, inRawBlock);
          tokens.push(token);
          continue;
        } // code


        if (token = this.tokenizer.codespan(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // br


        if (token = this.tokenizer.br(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // del (gfm)


        if (token = this.tokenizer.del(src)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.inlineTokens(token.text, [], this.copyAt(token.textStart), inLink, inRawBlock);
          tokens.push(token);
          continue;
        } // autolink


        if (token = this.tokenizer.autolink(src, mangle)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // url (gfm)


        if (!inLink && (token = this.tokenizer.url(src, mangle))) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // text
        // prevent inlineText consuming extensions by clipping 'src' to extension start


        cutSrc = src;

        if (this.options.extensions && this.options.extensions.startInline) {
          (function () {
            var startIndex = Infinity;
            var tempSrc = src.slice(1);
            var tempStart = void 0;

            _this2.options.extensions.startInline.forEach(function (getStartIndex) {
              tempStart = getStartIndex.call(this, tempSrc);

              if (typeof tempStart === 'number' && tempStart >= 0) {
                startIndex = Math.min(startIndex, tempStart);
              }
            });

            if (startIndex < Infinity && startIndex >= 0) {
              cutSrc = src.substring(0, startIndex + 1);
            }
          })();
        }

        if (token = this.tokenizer.inlineText(cutSrc, inRawBlock, smartypants)) {
          src = this.eatToken(src, token, at);

          if (token.raw.slice(-1) !== '_') {
            // Track prevChar before string of ____ started
            prevChar = token.raw.slice(-1);
          }

          keepPrevChar = true;
          lastToken = tokens[tokens.length - 1];

          if (lastToken && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        }

        if (src) {
          var errMsg = 'Infinite loop on byte: ' + src.charCodeAt(0);

          if (this.options.silent) {
            console.error(errMsg);
            break;
          } else {
            throw new Error(errMsg);
          }
        }
      }

      return tokens;
    };

    _createClass(Lexer, null, [{
      key: "rules",
      get: function get() {
        return {
          block: block,
          inline: inline
        };
      }
    }]);

    return Lexer;
  }();

nidoro · 2021-07-14T02:03:58Z

I think this is going to be much harder (nearly impossible) because of the line src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, ' ');. If the user uses tabs we won't be able to tell if four spaces are supposed to be one character or four.

Another way to make it work is to let the user tell how many spaces a tab corresponds to.

UziTech · 2021-07-14T02:07:03Z

marked does save the raw src in the token so I think like @calculuschild said it might be easiest to use walkTokens to add the line and column information. Then it can be an extension people can use if they want this information.

nidoro · 2021-07-14T02:19:18Z

Correct me if I'm wrong, but I think the raw member gets modified during the lexing process. I've seen lines in the source code like this: lastToken.raw += '\n' + token.raw;, which make me think the raw member is not really equivalent to the user input, and if I tried to calculate what line each token is at based on this raw string, I would get incorrect results.

calculuschild · 2021-07-14T02:25:52Z

I think the raw member gets modified during the lexing process.

This is simply merging two adjacent tokens. Occasionally we have to break a paragraph in half to check if a block of code or something else is beginning at that point. If it turns out that the second token is just the rest of a paragraph then we merge them back together. That's all that is happening; it should end up equivalent to the user input.

calculuschild · 2021-07-14T02:29:12Z

In fact this is another good reason to do this in walktokens. Some of the tokens are not completely formed when they are first added to the array, but waiting until walkTokens would ensure that the raw values are properly merged and accurate.

nidoro · 2021-07-14T02:41:34Z

I see. I guess there is no advantage in changing the Lexer directly then. Thanks for the clarification. I'll probably still use the solution I'm working on, as it is nearly done, but if I ever feel the need of using the walkTokens solution I'll share it here. Feel free to close the issue.

Again, thank you for the library!

nidoro · 2021-07-21T01:50:22Z

Hello again,

I abandoned the idea of modifying the lexer. Now I'm trying to use the raw tokens returned by marked.lexer(...) to calculate the position of each token. I was assuming that the concatenation of the raw members of first level tokens would give me the original input. Unfortunately, that's not the case. The following call...

marked.lexer("> quote\n\nparagraph");

returns these tokens:

[
  { type: "blockquote", raw: "> blockquote\n", … }
  { type: "paragraph", raw: "paragraph", … }
]

If I tried to calculate the starting line of the paragraph (or any of the following tokens) using the previous token as a reference, the line would be 2 (if we count from 1). But in the original input, the paragraph is at line 3, like so:

1 | > blockquote
2 | 
3 | paragraph

I'm working around this behavior by adding a line when I encounter a blockquote, but I don't know if that's reliable. So I have two questions:

Is this a bug on marked? I think the raw member would be more intuitively understood if it were possible to reconstruct the original input using only that.
Is skiping a line after blockquote reliable? And are there other cases in which the same issue will happen?

nidoro · 2021-07-21T02:17:41Z

Welp, there is definitely no way raw can be used to reconstruct the input. The following two calls:

marked.lexer("> quote\n# heading");
marked.lexer("> quote\n\n# heading");

return the exact same tokens:

[
  { type: "blockquote", raw: "> blockquote\n", text: "blockquote\n", … }
  { type: "heading", raw: "# heading", depth: 1, … }
]

So it is impossible to decide if the user entered one line or two lines after a blockquote.

Is there a fix or workaround for this?

calculuschild · 2021-07-21T04:51:24Z

raw should be the complete string that was consumed by the token. I can see how some beginning or ending newlines might be miscalculated in an off-by-one error somewhere that just happened to not affect the final HTML output.

So yes, it is a bug if the raw does not actually match the text that was consumed.

UziTech · 2021-07-21T04:52:42Z

Is there a fix or workaround for this?

Not yet, but if you want to create a PR we would be very appreciative 👍

It looks like the space token doesn't always get saved to the token list (not sure why)

marked/src/Lexer.js

Lines 143 to 149 in e7b04a7

    
           if (token = this.tokenizer.space(src)) { 
        
             src = src.substring(token.raw.length); 
        
             if (token.type) { 
        
               tokens.push(token); 
        
             } 
        
             continue; 
        
           }

bartnv · 2021-09-02T11:25:41Z

I have a similar need for a project. It's a notes app that uses Marked to render, with some interactivity in the html (like checking a checkbox) having a direct result in the source text (adding the x between [ ]). It would be very helpful for this kind of thing for Marked to expose the position of the rendered element in the source text.

I'll have a look at the walkTokens approach sometime soon. I'm only interested in the offset for the token, but it should be easy to go from there to line/column values by counting the newlines up to the offset.

bartnv · 2021-10-05T22:13:38Z

So I've tried the walkTokens approach, but it went south pretty quickly. My approach was to keep a running sum of the length of the raw field in each token. It sort-of works for block level elements as long as the source text doesn't contain tabs or Windows line-endings, but even then I had to account for quirks in the parser (for instance the last newline of a paragraph is never shown in any raw field if the paragraph is immediately followed by another block-level element). For inline items it all falls down. You'd need to know their offset from their containing block element, but there's no clean way to figure out how much of the raw input was consumed by the block-level element. For instance an H1 heading can validly start with "# " or " # ". That influences the offset of its inline elements, but you'd need to re-parse the raw string to figure that out. I don't want to account for all current and future possibilities there, so I think this is a dead end. For reference, this is the walkTokens function I was testing with (WHICH GIVES INVALID RESULTS, you have been warned):

  let walkTokens = function(token) {
    if (!token.seen) {
      token.offset = app.tokenoffset;
      app.tokenoffset += token.raw.length;
      if (token.type == 'paragraph') app.tokenoffset += 1;
    }
    if (token.tokens) { // Mark inline elements as seen as to not double-count them
      for (let item of token.tokens) item.seen = true;
    }
  };

The app object is my global state. app.tokenoffset needs to be initialized to 0 before each render.

UziTech · 2022-01-06T15:37:59Z

Looks like raw should be fixed in v4.0.9 #2341

derekhu · 2022-02-20T02:06:54Z

`walkTokens` full example for token position

following @calculuschild #2134 (comment)

You may be able to get something to work with the walkTokens feature by tracking the sum of the token "raw" lengths and adding a property to each token with the current total. Things would get more complex once you start getting into sub-tokens though but it should be possible.

example:

    let walkTokens = (token) => {
        let subs = token.tokens || token.items;
        if(subs){
            let start = (token._start || 0);
            let subpos = 0;
            subs.forEach(sub => {
                let substart = token.raw.indexOf(sub.raw, subpos);  
                let sublen = sub.raw.length;
                sub._start = substart + start;
                sub._end = sub._start + sublen;
                subpos = substart + sublen;
            });
        }
    }

Note: simplely sum "raw" length may not work well. For example of "- [ ] Task1" , 'text' token "Task1" do not have length for "- [ ] ".
So before we sum the raw length, try a search (indexOf) within each level.

Full test case using mocha:

let {marked} = require("marked");

// ...

describe("marked walkTokens", () => {
    let walkTokens = (token) => {
        let subs = token.tokens || token.items;
        if(subs){
            let start = (token._start || 0);
            let subpos = 0;
            subs.forEach(sub => {
                let substart = token.raw.indexOf(sub.raw, subpos);  
                let sublen = sub.raw.length;
                sub._start = substart + start;
                sub._end = sub._start + sublen;
                subpos = substart + sublen;
            });
        }
    }

    let testWalk = (md) => {
        let tokens = marked.lexer(md);
        let vroot = [{
            raw: md,
            tokens,
        }];
        marked.walkTokens(vroot, walkTokens);
        return tokens;
    }


    it(`case`, function(){
        let md = 
`## Title
- [ ] Task1 [#25](https://url)
  - [ ] Task2
- Item3
- Item4

## OK
![image](https://url)
`;

        let tokens = testWalk(md);


        let tokenTask = null;
        marked.walkTokens(tokens, (token)=>{
            if(token.type == "list_item" &&  
               token.text == "Task2"){
                tokenTask = token;
            }
        });

        if(tokenTask){
            let {_start, _end} = tokenTask;

            let before = md.substring(0, _start);
            let after = md.substring(_end);

            let content = before + tokenTask.raw + after;

            assert(content == md);
        }

    });

Tokens output: tokens with `_start` and `_end` position in source markdown string

tokens json detail：

[
  {
    "raw": "## Title\n- [ ] Task1 [#25](https://url)\n  - [ ] Task2\n- Item3\n- Item4\n\n## OK\n![image](https://url)\n",
    "tokens": [
      {
        "type": "heading",
        "raw": "## Title\n",
        "depth": 2,
        "text": "Title",
        "tokens": [
          {
            "type": "text",
            "raw": "Title",
            "text": "Title",
            "_start": 3,
            "_end": 8
          }
        ],
        "_start": 0,
        "_end": 9
      },
      {
        "type": "list",
        "raw": "- [ ] Task1 [#25](https://url)\n  - [ ] Task2\n- Item3\n- Item4",
        "ordered": false,
        "start": "",
        "loose": false,
        "items": [
          {
            "type": "list_item",
            "raw": "- [ ] Task1 [#25](https://url)\n  - [ ] Task2\n",
            "task": true,
            "checked": false,
            "loose": false,
            "text": "Task1 [#25](https://url)\n- [ ] Task2",
            "tokens": [
              {
                "type": "text",
                "raw": "Task1 [#25](https://url)\n",
                "text": "Task1 [#25](https://url)",
                "tokens": [
                  {
                    "type": "text",
                    "raw": "Task1 ",
                    "text": "Task1 ",
                    "_start": 15,
                    "_end": 21
                  },
                  {
                    "type": "link",
                    "raw": "[#25](https://url)",
                    "href": "https://url",
                    "title": null,
                    "text": "#25",
                    "tokens": [
                      {
                        "type": "text",
                        "raw": "#25",
                        "text": "#25",
                        "_start": 22,
                        "_end": 25
                      }
                    ],
                    "_start": 21,
                    "_end": 39
                  }
                ],
                "_start": 15,
                "_end": 40
              },
              {
                "type": "list",
                "raw": "- [ ] Task2",
                "ordered": false,
                "start": "",
                "loose": false,
                "items": [
                  {
                    "type": "list_item",
                    "raw": "- [ ] Task2",
                    "task": true,
                    "checked": false,
                    "loose": false,
                    "text": "Task2",
                    "tokens": [
                      {
                        "type": "text",
                        "raw": "Task2",
                        "text": "Task2",
                        "tokens": [
                          {
                            "type": "text",
                            "raw": "Task2",
                            "text": "Task2",
                            "_start": 48,
                            "_end": 53
                          }
                        ],
                        "_start": 48,
                        "_end": 53
                      }
                    ],
                    "_start": 42,
                    "_end": 53
                  }
                ],
                "_start": 42,
                "_end": 53
              }
            ],
            "_start": 9,
            "_end": 54
          },
          {
            "type": "list_item",
            "raw": "- Item3\n",
            "task": false,
            "checked": false,
            "loose": false,
            "text": "Item3",
            "tokens": [
              {
                "type": "text",
                "raw": "Item3",
                "text": "Item3",
                "tokens": [
                  {
                    "type": "text",
                    "raw": "Item3",
                    "text": "Item3",
                    "_start": 56,
                    "_end": 61
                  }
                ],
                "_start": 56,
                "_end": 61
              }
            ],
            "_start": 54,
            "_end": 62
          },
          {
            "type": "list_item",
            "raw": "- Item4",
            "task": false,
            "checked": false,
            "loose": false,
            "text": "Item4",
            "tokens": [
              {
                "type": "text",
                "raw": "Item4",
                "text": "Item4",
                "tokens": [
                  {
                    "type": "text",
                    "raw": "Item4",
                    "text": "Item4",
                    "_start": 64,
                    "_end": 69
                  }
                ],
                "_start": 64,
                "_end": 69
              }
            ],
            "_start": 62,
            "_end": 69
          }
        ],
        "_start": 9,
        "_end": 69
      },
      {
        "type": "space",
        "raw": "\n\n",
        "_start": 69,
        "_end": 71
      },
      {
        "type": "heading",
        "raw": "## OK\n",
        "depth": 2,
        "text": "OK",
        "tokens": [
          {
            "type": "text",
            "raw": "OK",
            "text": "OK",
            "_start": 74,
            "_end": 76
          }
        ],
        "_start": 71,
        "_end": 77
      },
      {
        "type": "paragraph",
        "raw": "![image](https://url)\n",
        "text": "![image](https://url)",
        "tokens": [
          {
            "type": "image",
            "raw": "![image](https://url)",
            "href": "https://url",
            "title": null,
            "text": "image",
            "_start": 77,
            "_end": 98
          }
        ],
        "_start": 77,
        "_end": 99
      }
    ]
  }
]

derekhu · 2022-02-20T05:17:07Z

Now my question is: how to render extra information from token into html, like _start, _end, shown above . The renderer doesn’t pass the token to rendering functions. so there are no extra token data can be render to DOM.

for interactive application, we can’t get token info from the DOM users action on . Like click - [ ] task checkbox, toggle - [ ] to - [x] etc.

UziTech · 2022-02-20T06:09:50Z

To get a checkbox change event you can just use JavaScript

document.querySelector("input[type=checkbox]").addEventListener("change", () => {...})

derekhu · 2022-02-20T06:17:40Z

To get a checkbox change event you can just use JavaScript
document.querySelector("input[type=checkbox]").addEventListener("change", () => {...})

Event is simple. Then, how to know which task item should be updated?

for example:

- [ ] Task1
  - [ ] task1.1
- [ ] Task1
- [ ] Task3

when user click checkbox of the second Task1 , which text should we update and toggle? By searching the text from <li> ? consider more complex scenes, we need hints from DOM. If we pass the token to renderer, I can render token position _start _end to DOM <li>,then on the click handler, We can get text from position, toggle it , and update back to main content.

UziTech · 2022-02-20T07:05:26Z

marked is only meant for converting markdown to html. For anything else you will need other tools.

derekhu · 2022-02-20T07:16:06Z

marked is only meant for converting markdown to html. For anything else you will need other tools.

well, I think marked.js is great. More interative things can be done through extension.

For the title of this issue "Position of a token in the source string", I have done thourgh marked.js's great extension way.
Using walkTokens, and raw, a good strucutre of token and tokens. This is the most great thing comparing to other library.

However, we just talk about the "converting markdown to html" . The html generated from token can have more extensibility.

So, just like @bartnv mentioned, he is building a notes app that uses Marked to render. Interactive features can be done in an extended way.

I am doing the similay thing. And markedjs is so great for me. If it can provide more extensibility between token tree and html rendering, it will be better.

Detail:

https://marked.js.org/using_pro#block-level-renderer-methods

Thanks for reply

yuis-ice · 2022-02-22T10:53:32Z

I'm looking for the feature too. I think there is certainly some demand for the feature for example passing the parsed markdown tokens (with lines and characters info) to a text editor e.g. VSCode so one can use it to develop an extension. Afaik there is no library that is fully capable to do it.

I will try to take a look the code and develop the feature when I got a free time. But I'd love to know if you have a fork when you completed the patching but you cannot simply pull request the work to the repository.

9001 · 2022-02-22T11:05:49Z

I have been maintaining a patch for line numbers: https://github.com/9001/copyparty/blob/hovudstraum/scripts/deps-docker/marked-ln.patch
however, it has many problems:

there is an off-by-one at the start of every ~~table~~ list
i don't think it will be possible to add higher accuracy than line numbers
very hacky :p

UziTech · 2022-02-22T19:18:22Z

It seems like to get this working we need to have something like token type (block or inline) in the walkTokens function.

bartnv · 2022-02-25T19:50:59Z

I've revisited this with version 4.0.12. The off-by-one errors in the token.raw are indeed gone, thanks for that UziTech. The only thing I would need from core markedjs is for each block level token in walkTokens to have the offset to the first child (inline) token within its raw string. So for a paragraph that would be the number of spaces and '#' characters at its start. For a list_item that would be leading spaces, '*' or '-' characters and a possible checkbox. Etc.

I've looked briefly at how this could be accomplished. It would require setting the 'd' flag on the block level regexes to get the 'indices' property on the result that specifies the offsets of the submatches within the match. The tokenizer can then add this offset to the token object. If you'd be willing to consider this then I can prepare a PR.

UziTech · 2022-02-25T21:28:35Z

@bartnv that would be great if you could create a PR. The one thing I would want to watch out for is bringing down the speed of marked. You can run npm run bench to run marked against commonmark.js and markdown-it. That will run the common mark specs against each 1000 times currently marked is a bit behind the others because the spec contains mostly edge cases.

louwers · 2022-05-27T21:04:07Z

I have been maintaining a patch for line numbers: https://github.com/9001/copyparty/blob/hovudstraum/scripts/deps-docker/marked-ln.patch

I have forked marked, applied the patch and published it to npm as @louwers/marked. In case anyone wants to play with it already, this may be a bit easier than applying a patch.

Looking forward for this to be a standard feature of marked.

DLiblik · 2022-08-11T15:28:03Z

Late to the party, but we've been trying to use marked in a wiki editor but unfortunately, though it's a great library, the mangling done to the src prior to, and during, lexing makes re-adding position tracking a negative force on performance.

This is hard to undo at this point without a significant rewrite, but if done well, position tracking should be a positive (not negative) force on performance. Removing tabs for example is not necessary if the logic for identifying tokens can assimilate tabs as well.

This would open marked up to many use cases that cannot be addressed currently (syntax highlighting of the source, link replacement across other MD documents when a target document is moved, etc.). Basically right now marked is trapped in a one-way translative role until this gets addressed. We'd PR it, but it's such a big "touch" to the code that it seems like it actually needs an in-project epic to migrate there.

Is that a possibility?

DLiblik · 2022-08-11T15:58:01Z

...or maybe promote it as a markdown renderer (its clear focus area: ideal for rendering into other language spaces) instead of as a general parser (for us it took time to find out it's not really usable as a parser except as a step towards rendering - a "true" parser suggests focus on the source document, not the output - we assumed it could do both and now are faced with switching to another lib).

Might save others like us from investing time with marked for a purpose it declares but ultimately isn't designed to address. It's clearly excellent for rendering, but definitely not for source document analysis of any depth. This is meant as positive feedback - better to bill it as a car, not a pickup truck, and avoid surprises.

UziTech · 2022-08-11T18:34:03Z

We'd PR it

Any way to improve marked is ok with me. 😁👍

I'm curious, at a high level, what would be needed to make marked work the way you explain?

maybe promote it as a markdown renderer

If you would like to create a PR that updates docs/readme I would be ok with that as well. I do, however, disagree that marked is a markdown renderer. Marked does not render markdown, it parses markdown to render it as HTML.

We do parse the markdown into tokens. Just because it doesn't have the information you want doesn't make it any less of a parser, but perhaps we could do a better job explaining what information is available.

dead-claudia · 2024-09-02T12:23:34Z

Ran into #3440 while attempting to expand on #2134 (comment). We've got a lot of markdown, and the tabs issue I'm pretty sure is the only thing standing in my way from having a working line counter for CI reporting.

Once tab positions are resolved, the walkTokens could use that to include source start offsets. Source offsets can then be converted fairly easily into line and column numbers outside Marked:

// There's more efficient ways, but this can work in a pinch for Windows and *nix.
const getLocation = (contents, offset) => {
	let source = contents.slice(0, offset)
	let line = 1
	let next = -1
	let prev = -1

	while ((next = source.indexOf("\n", prev + 1)) >= 0) {
		line++
		prev = next
	}

	return {offset: startOffset, line, column: startOffset - prev}
}

const getSpanLocations = (contents, token, startOffset) => {
	const endOffset = startOffset + token.raw.length
	return {
		start: getLocation(contents, startOffset),
		end: getLocation(contents, endOffset),
	}
}

calculuschild added the proposal label Jul 12, 2021

UziTech mentioned this issue Dec 30, 2021

The lexer appears to handle new lines incorrectly #2340

Closed

dead-claudia mentioned this issue Sep 2, 2024

Tabs are not preserved in raw output #3440

Closed

Question: Position of a token in the source string #2134

Question: Position of a token in the source string #2134

Comments

nidoro commented Jul 12, 2021 • edited Loading

calculuschild commented Jul 12, 2021

UziTech commented Jul 12, 2021

nidoro commented Jul 13, 2021

UziTech commented Jul 13, 2021 • edited Loading

UziTech commented Jul 13, 2021

nidoro commented Jul 14, 2021

nidoro commented Jul 14, 2021

UziTech commented Jul 14, 2021

nidoro commented Jul 14, 2021

calculuschild commented Jul 14, 2021 • edited Loading

calculuschild commented Jul 14, 2021

nidoro commented Jul 14, 2021

nidoro commented Jul 21, 2021 • edited Loading

nidoro commented Jul 21, 2021

calculuschild commented Jul 21, 2021

UziTech commented Jul 21, 2021

bartnv commented Sep 2, 2021

bartnv commented Oct 5, 2021

UziTech commented Jan 6, 2022

derekhu commented Feb 20, 2022 • edited Loading

walkTokens full example for token position

example:

Full test case using mocha:

Tokens output: tokens with _start and _end position in source markdown string

derekhu commented Feb 20, 2022 • edited Loading

UziTech commented Feb 20, 2022

derekhu commented Feb 20, 2022 • edited Loading

UziTech commented Feb 20, 2022

derekhu commented Feb 20, 2022 • edited Loading

Detail:

yuis-ice commented Feb 22, 2022

9001 commented Feb 22, 2022 • edited Loading

UziTech commented Feb 22, 2022

bartnv commented Feb 25, 2022

UziTech commented Feb 25, 2022

louwers commented May 27, 2022 • edited Loading

DLiblik commented Aug 11, 2022

DLiblik commented Aug 11, 2022

UziTech commented Aug 11, 2022

dead-claudia commented Sep 2, 2024

nidoro commented Jul 12, 2021 •

edited

Loading

UziTech commented Jul 13, 2021 •

edited

Loading

calculuschild commented Jul 14, 2021 •

edited

Loading

nidoro commented Jul 21, 2021 •

edited

Loading

derekhu commented Feb 20, 2022 •

edited

Loading

`walkTokens` full example for token position

Tokens output: tokens with `_start` and `_end` position in source markdown string

derekhu commented Feb 20, 2022 •

edited

Loading

derekhu commented Feb 20, 2022 •

edited

Loading

derekhu commented Feb 20, 2022 •

edited

Loading

9001 commented Feb 22, 2022 •

edited

Loading

louwers commented May 27, 2022 •

edited

Loading