Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Request: Replace graphemer with Intl.Segmenter #17835

Closed
1 task done
ota-meshi opened this issue Dec 11, 2023 · 18 comments · Fixed by #18110
Closed
1 task done

Change Request: Replace graphemer with Intl.Segmenter #17835

ota-meshi opened this issue Dec 11, 2023 · 18 comments · Fixed by #18110
Labels
core Relates to ESLint's core APIs and features enhancement This change enhances an existing feature of ESLint

Comments

@ota-meshi
Copy link
Member

ota-meshi commented Dec 11, 2023

ESLint version

v8.55.0

What problem do you want to solve?

Related to #17725 (comment)

If we use the new Node.js we can use Intl.Segmenter instead of graphemer.
Intl.Segmenter can be used to reduce ESLint's dependency packages.

What do you think is the correct solution?

Replace graphemer with Intl.Segmenter.

before:

    if (!splitter) {
        splitter = new Graphemer();
    }

    return splitter.countGraphemes(value);

after:

    if (!segmenter) {
        segmenter = new Intl.Segmenter();
    }

    return [...segmenter.segment(value)].length;

Participation

  • I am willing to submit a pull request for this change.

Additional comments

I tested the compatibility of graphemer and Intl.Segmenter. The test case in the graphemer repository did not change the result even if I replaced it with Intl.Segmenter.
However, "अनुच्छेद" in the README has a different result.

graphemer: 'अ', 'नु', 'च्', 'छे', 'द'
Intl.Segmenter: 'अ', 'नु', 'च्छे', 'द'

(It is not known whether this is a bug or intentional.) See #17835 (comment)

The results will change, but I think it will be accepted by users as a minor breaking change if included in ESLint v9.


Repositories used for testing:
https://github.com/ota-meshi/graphemer/tree/compat-intl-segmenter
Commit:
ota-meshi/graphemer@de301e6

@ota-meshi ota-meshi added enhancement This change enhances an existing feature of ESLint core Relates to ESLint's core APIs and features labels Dec 11, 2023
@ota-meshi
Copy link
Member Author

ota-meshi commented Dec 11, 2023

graphemer: 'अ', 'नु', 'च्', 'छे', 'द'
Intl.Segmenter: 'अ', 'नु', 'च्छे', 'द'

Probably related to orling/grapheme-splitter#26

@ota-meshi
Copy link
Member Author

graphemer seems to be unicode v15.0 compliant and uses GraphemeBreakTest v15.0 for testing.
However, if I replace this with GraphemeBreakTest v15.1, graphemer will fail the test. The test also passes in GraphemeBreakTest v15.1 when using Intl.Segmenter.

https://github.com/flmnt/graphemer/blob/master/tests/GraphemeBreakTest.txt
https://www.unicode.org/Public/15.0.0/ucd/auxiliary/GraphemeBreakTest.txt
https://www.unicode.org/Public/15.1.0/ucd/auxiliary/GraphemeBreakTest.txt

@mdjermanovic
Copy link
Member

 if (!segmenter) {
        segmenter = new Intl.Segmenter();
    }

Per MDN - Intl.Segmenter() constructor, it expects a locales argument. Per locales argument docs, if it's omitted, the implementation's default locale will be used.

Does that mean that we could have different behavior in different environments? For example, linting passes locally but then fails on CI because the default locale is different?

@nzakas
Copy link
Member

nzakas commented Jan 23, 2024

@ota-meshi are you still interesting in working on this? Are you still doing research?

@ota-meshi
Copy link
Member Author

I haven't looked into it much since then. Perhaps the language in which the problem seems to be a language that is unfamiliar to me so I am having trouble understanding it 😓

@nzakas
Copy link
Member

nzakas commented Jan 24, 2024

If that's the case, shall we close this?

@ota-meshi
Copy link
Member Author

ota-meshi commented Jan 24, 2024

What do you think is the best way to resolve the issue of incompatibility with Unicode v15.1?
Would you like to close this issue by reporting it on graphemer repository? In my opinion it is better to use Intl.Segmenter.

@kecrily
Copy link
Member

kecrily commented Jan 25, 2024

Can we specify en-US as the default language?

@ota-meshi
Copy link
Member Author

That idea sounds good to me! At least I don't think the behavior will change depending on the environment.

@fasttime
Copy link
Member

fasttime commented Feb 8, 2024

I like the idea of using Intl.Segmenter for simplicity but the results it produces will inevitably change when a new version of Node.js switches to a newer version of Unicode. The effect would be that the same rule on the same code could give different results on different versions of Node.js. I don't think this is terribly bad, I think it's actually better than depending on a huge semi-automatically generated program that eventually gets outdated (lib/rules/utils/patterns/letters.js is another one), but it's a complication to keep in mind. Regular expressions with character properties like \p{Mc} can also change their behavior with the version of Unicode, and we are already using them.

@nzakas
Copy link
Member

nzakas commented Feb 9, 2024

I think we're at the point where we either need someone to prototype this or we should close this issue. Any volunteers?

@fasttime
Copy link
Member

fasttime commented Feb 10, 2024

I went ahead and prepared a draft: #18110.

One problem that hasn't been discussed yet is that Intl.Segmenter is not implemented in Firefox (though it could be soon, the corresponding issue is marked as fixed: https://bugzilla.mozilla.org/show_bug.cgi?id=1423593). This means that if a rule using Intl.Segmenter were opened in the Playground in Firefox, it would crash. So if we decide to go for this change, we should at least wait until Intl.Segmenter is supported in all major browsers.

@nzakas
Copy link
Member

nzakas commented Feb 13, 2024

One problem that hasn't been discussed yet is that Intl.Segmenter is not implemented in Firefox

Ooh, great catch! I agree, we should definitely hold off until Firefox supports this.

Copy link

Oops! It looks like we lost track of this issue. What do we want to do here? This issue will auto-close in 7 days without an update.

@github-actions github-actions bot added the Stale label Mar 14, 2024
@nzakas
Copy link
Member

nzakas commented Mar 21, 2024

Blocked by Firefox implementing Intl.Segmenter, so not stale.

@github-actions github-actions bot removed the Stale label Mar 21, 2024
@fasttime
Copy link
Member

Intl.Segmenter is unflagged in Firefox 125, so we could consider using it instead of graphemer now. Anyway, there is also an alternative proposal to replace graphemer with a different package: #18359.

@nzakas
Copy link
Member

nzakas commented Apr 18, 2024

I'd definitely prefer using Intl.Segmenter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Relates to ESLint's core APIs and features enhancement This change enhances an existing feature of ESLint
Projects
Status: Complete
Development

Successfully merging a pull request may close this issue.

5 participants