Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove need for JSON validator for consts and enums #59

Merged
merged 2 commits into from
Nov 21, 2024

Conversation

hudson-ai
Copy link
Collaborator

Re-writes consts as other types, e.g. arrays with specific prefixItems, ints with minimum==maximum, etc. Enums are then written as anyOf(consts).

This lets us use our internal validation logic rather than relying on jsonschema.

Note: I also fixed the grammar-builder's handling of UnsatisfiableSchemaError in anyOf -- it now appropriately propagates only when there are no valid schemas.

@mmoskal
Copy link
Member

mmoskal commented Nov 21, 2024

Did you check that we get sensible grammars for enum: ["foo", "bar", "baz"] and enum: [1, 2, 3] and also const: 1 and const: "foo" ? in particular the regex for 1 should be /1/

I think we can live with select("foo", "bar", "baz") instead of lexeme("foo|bar|baz") for enum for now but would be good to fix at some point (compiler can detect all branches that are regexes and combine them)

@hudson-ai
Copy link
Collaborator Author

Did you check that we get sensible grammars for enum: ["foo", "bar", "baz"] and enum: [1, 2, 3] and also const: 1 and const: "foo" ? in particular the regex for 1 should be /1/

I think we can live with select("foo", "bar", "baz") instead of lexeme("foo|bar|baz") for enum for now but would be good to fix at some point (compiler can detect all branches that are regexes and combine them)

Currently just calling select for enums, so we get this for enum: ["foo", "bar", "baz"] (1,2,3 looks the same except they aren't treated as json_string lexemes)

[{'Join': {'sequence': [4],
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': 'foo',
      'contextual': None,
      'temperature': None,
      'json_string': True,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': 'bar',
      'contextual': None,
      'temperature': None,
      'json_string': True,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': 'baz',
      'contextual': None,
      'temperature': None,
      'json_string': True,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Select': {'among': [1, 2, 3],
      'max_tokens': None,
      'name': None,
      'capture_name': None}}]

@hudson-ai
Copy link
Collaborator Author

And consts will look the same but without the select. E.g. const: 1

[{'Join': {'sequence': [1],
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': '(1)',
      'contextual': None,
      'temperature': None,
      'json_string': False,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}}]

@hudson-ai
Copy link
Collaborator Author

For more complex types, e.g. {"const": {"key": "value"}}, we are getting a full grammar out of it, even though we could in principle rewrite it as a regex...

[{'Join': {'sequence': [9],
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Lexeme': {'rx': 'value',
      'contextual': None,
      'temperature': None,
      'json_string': True,
      'json_allowed_escapes': None,
      'json_raw': None,
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': '"key"',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': ': ',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': '',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Join': {'sequence': [2, 3, 1],
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': '{',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': ', ',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'String': {'literal': '}',
      'max_tokens': None,
      'name': None,
      'capture_name': None}},
    {'Join': {'sequence': [6, 5, 8],
      'max_tokens': None,
      'name': None,
      'capture_name': None}}]

@mmoskal
Copy link
Member

mmoskal commented Nov 21, 2024

wonderful! good to merge!

@hudson-ai hudson-ai merged commit f206ad9 into microsoft:main Nov 21, 2024
2 checks passed
@mmoskal
Copy link
Member

mmoskal commented Nov 21, 2024

I don't think there's much point rewriting general consts as regexes. The one I worry about is a very large enum with just strings in it (I think it's relatively common) - it's easy to do regex and it will avoid parser limits.

@hudson-ai
Copy link
Collaborator Author

I don't think there's much point rewriting general consts as regexes. The one I worry about is a very large enum with just strings in it (I think it's relatively common) - it's easy to do regex and it will avoid parser limits.

Totally. And I like the idea of detecting this kind of situation in the builder itself.

@mmoskal
Copy link
Member

mmoskal commented Nov 21, 2024

you mean GrammarBuilder? I think it's not always valid (though it would be valid for JSON)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants