Low-level tokenization mode for Glow #206

fabiospampinato · 2024-02-17T17:02:25Z

I've been thinking about ways of deleting Oniguruma from my bundles, which is needed for handling TextMate grammars, which are commonly used to syntax highlight languages, and Glow seems a very interesting way to do that and more.

I'm building a new experimental tiny code editor for the web, and I'd be interested in wiring it with Glow. For that I'm not interested in emitting HTML, but rather I'd need the syntax highlighter to return me a list of tokens, which basically would tell me what color to use at each ranges.

Is there any interest in adding a low-level tokenization function like that?

Ideally something a bit lower-level than that, where one is able to ask the syntax highlighter for tokens going line by line explicitly, so that the main thread is potentially never blocked for a long time, would be ideal, but for a lot of use cases something simpler should suffice.

tipiirai · 2024-02-21T07:36:02Z

Sounds like a legit idea. I was planning to implement clearer parsing/tokenization and rendering phases because there is a need for more customized highlighting per language.

I'm sorry this answer took so long. My mind has been occupied with the upcoming design system, but I'm planning to make a round of updates to Glow and Nuekit internals before launching it.

Thanks

tipiirai · 2024-02-23T01:12:58Z

@fabiospampinato there is a public parseRow method that now understands inline comments with the most recent commit. It will return an array of tokens in following format:

[
  { start: 0, end: 1, tag: "i", re: /[^\w \u2022]/g, },
  { start: 11, end: 18, tag: "em", re: /'[^']*'|"[^"]*"/g, is_string: true, }
  ...
]

Where start is the start index and end is the end index in the inputted string.

Hope this helps. This method only understands individual rows so it has no clue about multi-line comments.

fabiospampinato · 2024-02-23T01:19:15Z

Nice thanks 👍 Are the tokens covering the entire input string? Like what should happen in that example between indexes 1 and 11?

fabiospampinato · 2024-02-28T19:46:29Z

@tipiirai the new function is not exported from the entrypoint, could you fix this?

fabiospampinato · 2024-02-28T21:38:06Z

@tipiirai the tokenization seems a bit wrong. With this code:

import {parseRow} from 'nue-glow/src/glow.js';

const code = "import shiki from 'shiki';";
const lang = "js";

const tokens = parseRow ( code, lang );

I get the following tokens:

{start: 0, end: 6, tag: 'strong', re: /\b(null|true|false|undefined|import|from|async|aw…l|until|next|bool|ns|defn|puts|require|each)\b/gi}
{start: 13, end: 17, tag: 'strong', re: /\b(null|true|false|undefined|import|from|async|aw…l|until|next|bool|ns|defn|puts|require|each)\b/gi}
{start: 18, end: 25, tag: 'em', re: /'[^']*'|"[^"]*"/g, is_string: true}
{start: 18, end: 19, tag: 'i', re: /[^\w •]/g}
{start: 24, end: 25, tag: 'i', re: /[^\w •]/g}
{start: 25, end: 26, tag: 'i', re: /[^\w •]/g}

Which are problematic because you can spot right away that there are 3 tokens wrapping around a single character, but our input string ends with shiki';, so there's no reasonable scenario where there would be the 3 length-1 tokens there at the end.

If I explicitly slice those ranges off from the input string I get this array:

['import', 'from', "'shiki'", "'", "'", ';']

So basically there are two tokens about apostrophes for the string that shouldn't exist 😢

fabiospampinato · 2024-02-28T22:23:44Z

I just released the "convenient" highlighter/tokenizer on top of Glow that I had in mind: https://twitter.com/fabiospampinato/status/1762965155841773879

Generally, FWIW, I really like this approach, and if more effort could be put into refining the syntax highlighter I think it could actually be pretty decent for a lot of use cases.

Some areas that IMO would be nice if they could be improved:

Producing complete tokens, that cover every input character.
Not producing unnecessary tokens, like the ones mentioned in the message above.
Improving support for some languages nested inside other languages, like JS inside a <script> tag.
Maybe special-casing more things, like rendering things that look like unary/binary/ternary operators with the accent color too.
Refining keyword-detection to not consider a word to be a keyword if it comes right after a ..
Detecting backtick-delimited strings as strings too.
Possibly refining syntax highlighting for lots of other little edge cases.

IMO with relatively few tweaks it would be closer to the quality that TextMate can achieve in a lot more cases.

Example comparison I got, with Glow on the left and TextMate on the right:

Code I used for the example:

import shiki from 'shiki';

// Some example code

shiki
  .getHighlighter({
    theme: 'nord',
    langs: ['js'],
  })
  .then(highlighter => {
    const code = highlighter.codeToHtml(`console.log('shiki');`, { lang: 'js' })
    document.getElementById('output').innerHTML = code
  });

tipiirai added the improvement label Feb 21, 2024

tipiirai self-assigned this Feb 21, 2024

This comment was marked as resolved.

Sign in to view

tipiirai added a commit that referenced this issue Feb 23, 2024

Glow: parseRow() method to understand inline comments. #206

dcdcf1d

nobkd added the nue-glow Related to nue-glow package label Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low-level tokenization mode for Glow #206

Low-level tokenization mode for Glow #206

fabiospampinato commented Feb 17, 2024

tipiirai commented Feb 21, 2024

This comment was marked as resolved.

tipiirai commented Feb 23, 2024

fabiospampinato commented Feb 23, 2024

fabiospampinato commented Feb 28, 2024

fabiospampinato commented Feb 28, 2024

fabiospampinato commented Feb 28, 2024 •

edited

Loading

Low-level tokenization mode for Glow #206

Low-level tokenization mode for Glow #206

Comments

fabiospampinato commented Feb 17, 2024

tipiirai commented Feb 21, 2024

This comment was marked as resolved.

tipiirai commented Feb 23, 2024

fabiospampinato commented Feb 23, 2024

fabiospampinato commented Feb 28, 2024

fabiospampinato commented Feb 28, 2024

fabiospampinato commented Feb 28, 2024 • edited Loading

fabiospampinato commented Feb 28, 2024 •

edited

Loading