Treat UTF-16 surrogate pairs as single characters for string min/maxLength #88

hudson-ai · 2024-12-10T23:25:31Z

Makes the following test pass:

  {
      "description": "minLength validation",
      "schema": {
          "$schema": "https://json-schema.org/draft/2020-12/schema",
          "minLength": 2
      },
      "tests": [
          {
              "description": "one grapheme is not long enough",
              "data": "\uD83D\uDCA9",
              "valid": false
          }
      ]
  }

…icode code points/graphemes

hudson-ai · 2024-12-10T23:27:19Z

parser/src/json/compiler.rs

    }

    fn json_simple_string(&mut self) -> NodeRef {
-        self.lexeme(&format!("\"{}*\"", CHAR_REGEX))
+        self.lexeme("(?s:.*)", true)


Note that this change is orthogonal to the underlying PR -- @mmoskal if using the CHAR_REGEX directly is marginally more performant, I can switch it back.

hudson-ai · 2024-12-10T23:29:07Z

parser/src/json/compiler.rs

-            )))
+            Ok(self.lexeme(
+                &format!(
+                    "(?s:.{{{},{}}})",


derivre seems smart enough to match \uD83D\uDCA9 with ., so length is counted appropriately

mmoskal · 2024-12-11T10:50:36Z

The JSON-quoted derivre strings do not allow \uxxxx escapes except for \x00xx. The assumption is that everything else will be emitted as UTF-8.

It's all the same as far as JSON goes (when you read JSON {"x":"ł"} or {"x":"\u0142"} you get exactly the same object in memory), but I don't know if we want to let the model output the general \uxxxx sequences in non-regex case. I would say not just to be consistent. If we do allow it, we would also need to check if the surrogate pairs with \xD8xx are output correctly (that is the high surrogate only ever occurs before low surrogate and vice versa).

Added comment here microsoft/derivre@6062cef

Some general notes (from you-know-who):

UTF-8 in JSON
• JSON strings must be encoded in UTF-8 by default.
• Any Unicode character can appear directly in a JSON string if the encoding supports it, such as "\u1234" for U+1234.
• Non-ASCII characters are often escaped for compatibility using Unicode escapes (\u).

Surrogate Pairs in JSON
• Surrogate pairs are used to represent characters outside the Basic Multilingual Plane (BMP) (U+10000 to U+10FFFF) in UTF-16.
• A surrogate pair consists of two code units:
• High surrogate: \uD800 to \uDBFF.
• Low surrogate: \uDC00 to \uDFFF.
• These pairs combine to encode a single Unicode code point.
• E.g., the code point U+1F600 (😀) is encoded as \uD83D\uDE00.

mmoskal · 2024-12-11T10:53:57Z

Oh and BTW the test will pass if do json.dumps(..., ensure_ascii=False)

hudson-ai added 3 commits December 10, 2024 15:06

rely on derivre to properly treat UTF-16 surrogate pairs as single un…

080d15c

…icode code points/graphemes

use all applicable args for caching

1b7e5ff

clean up lexeme caching a bit

0c78638

hudson-ai requested a review from mmoskal December 10, 2024 23:25

hudson-ai commented Dec 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat UTF-16 surrogate pairs as single characters for string min/maxLength #88

Treat UTF-16 surrogate pairs as single characters for string min/maxLength #88

hudson-ai commented Dec 10, 2024

hudson-ai Dec 10, 2024

hudson-ai Dec 10, 2024

mmoskal commented Dec 11, 2024 •

edited

Loading

mmoskal commented Dec 11, 2024

Treat UTF-16 surrogate pairs as single characters for string min/maxLength #88

Are you sure you want to change the base?

Treat UTF-16 surrogate pairs as single characters for string min/maxLength #88

Conversation

hudson-ai commented Dec 10, 2024

hudson-ai Dec 10, 2024

Choose a reason for hiding this comment

hudson-ai Dec 10, 2024

Choose a reason for hiding this comment

mmoskal commented Dec 11, 2024 • edited Loading

mmoskal commented Dec 11, 2024

mmoskal commented Dec 11, 2024 •

edited

Loading