Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat UTF-16 surrogate pairs as single characters for string min/maxLength #88

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

hudson-ai
Copy link
Collaborator

Makes the following test pass:

  {
      "description": "minLength validation",
      "schema": {
          "$schema": "https://json-schema.org/draft/2020-12/schema",
          "minLength": 2
      },
      "tests": [
          {
              "description": "one grapheme is not long enough",
              "data": "\uD83D\uDCA9",
              "valid": false
          }
      ]
  }

@hudson-ai hudson-ai requested a review from mmoskal December 10, 2024 23:25
}

fn json_simple_string(&mut self) -> NodeRef {
self.lexeme(&format!("\"{}*\"", CHAR_REGEX))
self.lexeme("(?s:.*)", true)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this change is orthogonal to the underlying PR -- @mmoskal if using the CHAR_REGEX directly is marginally more performant, I can switch it back.

)))
Ok(self.lexeme(
&format!(
"(?s:.{{{},{}}})",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

derivre seems smart enough to match \uD83D\uDCA9 with ., so length is counted appropriately

@mmoskal
Copy link
Collaborator

mmoskal commented Dec 11, 2024

The JSON-quoted derivre strings do not allow \uxxxx escapes except for \x00xx. The assumption is that everything else will be emitted as UTF-8.

It's all the same as far as JSON goes (when you read JSON {"x":"ł"} or {"x":"\u0142"} you get exactly the same object in memory), but I don't know if we want to let the model output the general \uxxxx sequences in non-regex case. I would say not just to be consistent. If we do allow it, we would also need to check if the surrogate pairs with \xD8xx are output correctly (that is the high surrogate only ever occurs before low surrogate and vice versa).

Added comment here microsoft/derivre@6062cef

Some general notes (from you-know-who):

UTF-8 in JSON
• JSON strings must be encoded in UTF-8 by default.
• Any Unicode character can appear directly in a JSON string if the encoding supports it, such as "\u1234" for U+1234.
• Non-ASCII characters are often escaped for compatibility using Unicode escapes (\u).

Surrogate Pairs in JSON
• Surrogate pairs are used to represent characters outside the Basic Multilingual Plane (BMP) (U+10000 to U+10FFFF) in UTF-16.
• A surrogate pair consists of two code units:
• High surrogate: \uD800 to \uDBFF.
• Low surrogate: \uDC00 to \uDFFF.
• These pairs combine to encode a single Unicode code point.
• E.g., the code point U+1F600 (😀) is encoded as \uD83D\uDE00.

@mmoskal
Copy link
Collaborator

mmoskal commented Dec 11, 2024

Oh and BTW the test will pass if do json.dumps(..., ensure_ascii=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants