Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Normalization #2512

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Unicode Normalization #2512

wants to merge 4 commits into from

Conversation

karwa
Copy link
Contributor

@karwa karwa commented Jul 3, 2024

No description provided.

@rjmccall rjmccall added standard library addition Additive changes to the standard library LSG Contains topics under the domain of the Language Steering Group labels Jul 8, 2024
@milseman
Copy link
Member

At this point, it probably makes sense to put the pitch up sooner rather than fully incorporate the feedback below. The below might make sense as an "Additional Design Considerations" section. Some of the API feedback we could incorporate right away.

Normal forms and data tables

The stdlib currently ships data tables which support NFC and NFD normal forms. The stdlib does not currently ship NFKC and NFKD data tables. NFKC and NFKD are for purposes other than checking canonical equivalence, and as such it may make sense to relegate those API to another library instead of the stdlib, such as swift-foundation.

Stability over Unicode versions

The concept of stability and "any version of Unicode" might be too nuanced for these doc comments.

Broadly speaking, it doesn't make much sense to talk about Unicode prior to 2.0 outside of extremely niche, archeological use cases. The stdlib shouldn't really bother with any Unicode processing prior to 3.0, as those versions permitted overlong encodings in UTF-8. Canonical equivalence only really make sense for Unicode 3.1 and later, and the modern notion of normalization stability is a 4.1 and later guarantee.

API feedback

Stability seems a little more niche. It's relevant for domains that may have an invariant over content that is held across different processes running different versions of Swift or using different Unicode implementations.

It seems like some kind of isFullyAssigned or similar query, and possibly one which could take a Unicode version, might make more sense. Stability is a current-version-and-future guarantee, and I'm not sure an API that doesn't take a version would fit the invariant-held-across-processes use case.

We can also make initializers that will decode-and-normalize the content in one go. I.e. a String(normalizing:as:in:) variant for String(decoding:as:).

I'm not entirely sold on the normalized API on Character. I suppose it makes some kind of algebraic sense, but it seems like you should normalize the String itself instead.

I think we can come out and say that the preferred form is NFC.

We pitch an extension on Sequence<Unicode.Scalar>, but this could instead be an init on NormalizedScalars. For now I think sequences of Unicode.Scalar make sense, but in the future we'd want some kind of API to support sequences of validly encoded UTF-8 bytes.

For NFDNormalizer, what about an overload that takes a Unicode.Scalar-producing closure? That way, a caller doesn't need to construct an IteratorProtocol across a compilation boundary.

In addition to String.init(_: Sequence<Unicode.Scalar), what do you think of Character.init?(_: Sequence<Unicode.Scalar>) which will return nil if it's more than one grapheme cluster in length?

Another future direction would be protocols that abstract the normalization itself so that other libraries can plug in their own data tables or provide their own normal forms.

milseman and others added 2 commits July 14, 2024 15:01
Added some prose, alternatives, closure-taking API, and future direct…
@lorentey
Copy link
Member

I'm not entirely sold on the normalized API on Character. I suppose it makes some kind of algebraic sense, but it seems like you should normalize the String itself instead.

I'm also wondering about this.

From what I can tell, Unicode does not make any promises about the interaction between grapheme breaking and normalization. In fact, it encourages implementations to tailor the grapheme breaking rules to their liking, which would probably render all such promises toothless. (Swift itself implements non-standard rules, although hopefully they'll go away when we upgrade to Unicode 15.1. Until we decide to add more custom rules, that is -- for example, to match behavior between the Swift stdlib and the macOS/iOS text display/processing frameworks.)

Is it guaranteed that the position/number of grapheme breaks will not be affected by normalization?

If not, we probably cannot provide normalization APIs on Character. (Or at least we cannot have any API that promises to return a single Character instance.)

@karwa
Copy link
Contributor Author

karwa commented Jul 21, 2024

The motivation for adding normalization to Character is that:

  • People do use it for storing text, including in sets and dictionaries.
  • It has UTF8/UTF16/Scalar views, so it observably alters the data in the character.
  • It's the only normalisation segment we currently expose, and it has some meaning to the person reading the text.

Unfortunately, Characters are usually small strings, which don't have an isNFC bit, so you're unlikely to see performance gains unless you make use of the UTF8/UTF16/Scalar views. If we could add the isNFC bit to small strings (e.g. on Linux/Windows, or future Apple platforms, or even current Apple platforms if there is some very clever way to squeeze it in), that would help us realise some significant gains.


Canonical normalisation should be safe to add to Character even though grapheme-breaking rules are not stable. Since a Character is just a String with length 1, the question can be rephrased: will grapheme-breaking ever see a different number of characters in two canonically-equivalent strings?

And I think it's clear that, whichever rules we use (and even allowing for tailoring), that should never happen. If it ever did, it would indicate a bug in grapheme breaking.

@lorentey
Copy link
Member

lorentey commented Jul 22, 2024

Canonical normalisation should be safe to add to Character even though grapheme-breaking rules are not stable. Since a Character is just a String with length 1, the question can be rephrased: will grapheme-breaking ever see a different number of characters in two canonically-equivalent strings?

And I think it's clear that, whichever rules we use (and even allowing for tailoring), that should never happen. If it ever did, it would indicate a bug in grapheme breaking.

I do agree that it would very much be desirable if we could simply assume that (canonical) normalization will not affect grapheme breaking boundaries. But this isn't about hopes and dreams -- can we rely on current and future versions of Unicode to ensure this, or not? If not, then the stdlib must not make such guarantees on its API level.

I could not find a place anywhere in the Unicode standard where this is called out as a feature of the current definitions (is it?), much less any indication that this property is covered by a Unicode stability policy.

Please prove me otherwise -- it would be a huge relief if this was an invariant we could trust. I would much prefer if equal String instances always had matching counts: I would not like to work through the mind boggling consequences of finding a counterexample. (For what it's worth, compatibility normalization definitely does not preserve grapheme break counts.)

FWIW, as of 15.1, the baseline grapheme cluster boundary rules rely on properties such as Indic_Conjunct_Break (or Indic_Syllabic_Category) that don't even seem to be subject to any stability constraints.

@milseman
Copy link
Member

While I've always found it to be true in practice that grapheme break boundaries are also normalization segment boundaries, and that normalizing a grapheme cluster would not introduce new grapheme breaks, I don't think we can rely on this. We should probably remove the normalization API on Character. Since Character is a String (in ABI) of length one, users can normalizing the underlying String and recheck the count.

@karwa
Copy link
Contributor Author

karwa commented Jul 25, 2024

@lorentey - Citations, as requested 😇

UAX#29 - Unicode Text Segmentation

2 Conformance

[...]

To maintain canonical equivalence, all of the following specifications are defined on text normalized in form NFD, as defined in Unicode Standard Annex #15, “Unicode Normalization Forms”. Boundaries never occur within a combining character sequence or conjoining sequence, so the boundaries within non-NFD text can be derived from corresponding boundaries in the NFD form of that text. For convenience, the default rules have been written so that they can be applied directly to non-NFD text and yield equivalent results. (This may not be the case with tailored default rules.)

3 Grapheme Cluster Boundaries

[...]

A key feature of Unicode grapheme clusters (both legacy and extended) is that they remain unchanged across all canonically equivalent forms of the underlying text. Thus the boundaries remain unchanged whether the text is in NFC or NFD. Using a grapheme cluster as the fundamental unit of matching thus provides a very clear and easily explained basis for canonically equivalent matching. This is important for applications from searching to regular expressions.

6.1 Normalization

The boundary specifications are stated in terms of text normalized according to Normalization Form NFD (...). In practice, normalization of the input is not required. To ensure that the same results are returned for canonically equivalent text (that is, the same boundary positions will be found, although those may be represented by different offsets), the grapheme cluster boundary specification has the following features:

  • There is never a break within a sequence of nonspacing marks.
  • There is never a break between a base character and subsequent nonspacing marks.

The specification also avoids certain problems by explicitly assigning the Extend property value to certain characters, such as U+09BE ( া ) BENGALI VOWEL SIGN AA, to deal with particular compositions.

The other default boundary specifications never break within grapheme clusters, and they always use a consistent property value for each grapheme cluster as a whole.

I think we can rely on this - it seems the specifications go out of their way to ensure it holds. As it says, "using a grapheme cluster as the fundamental unit of matching thus provides a very clear and easily explained basis for canonically equivalent matching" -- this is only possible if boundaries are the same for canonically-equivalent strings.

I appreciate that it is less common/useful than normalising a String, but the conversion String(myChar).normalized(.nfc).first! is annoying and unintuitive, so I think if we can offer this then it sure would be nice to.

That said, if we're still unconvinced, I don't mind deferring it.

@milseman
Copy link
Member

TIL. That sounds like the guarantee we need, @lorentey what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LSG Contains topics under the domain of the Language Steering Group standard library addition Additive changes to the standard library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants