Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default behavior of json generation is likely more verbose than users expect #1050

Open
hudson-ai opened this issue Oct 15, 2024 · 3 comments
Assignees

Comments

@hudson-ai
Copy link
Collaborator

hudson-ai commented Oct 15, 2024

We currently respect JSON Schema's semantics around the additionalProperties keyword; i.e. leaving it unset is interpreted as "any property not specified by the properties keyword (1) is allowed and (2) has no restrictions on its value (other than being valid JSON)".

These semantics are useful (see #887), and I think we should continue respecting them.

That being said, I also think that the majority of our users will expect the LLM to only produce the properties that were explicitly requested. Anything "extra" costs the user time and money, and it will likely be thrown away.

I recommend the following solution:

  1. Recommend that all users use the pydantic interface (passing a BaseModel subclass to the json generation function) UNLESS they want the extra fine-grained control that directly passing a JSON object provides.
  2. Write a custom BaseModel.model_json_schema implementation that sets additionalProperties to False and use it to convert the BaseModel to a JSON schema we'll generate against.

I feel far more comfortable being "opinionated" when our users use the high-level pydantic interface rather than the low-level JSON interface.

Additional notes:

  • currently users CAN get this behavior from pydantic if they add model_config = ConfigDict(extra="forbid") to their BaseModel
  • extra is allowed to take on the following values: "ignore" (default), "allow", "forbid"
  • when extra is "ignore", any additional properties will be discarded when validating an instance, while "allow" will propagate their values to the constructed instance
  • it should be safe to coerce "ignore" to "forbid" since we know anything extra is going to be discarded anyway
@hudson-ai hudson-ai self-assigned this Oct 15, 2024
@hudson-ai
Copy link
Collaborator Author

@JC1DA @Harsha-Nori ping for visibility

@Harsha-Nori
Copy link
Collaborator

I think of json as a bit of a "higher level" guidance function (e.g. in contrast with select), and therefore think it's OK to have reasonable but configurable defaults across both the high-level pydantic and slightly lower-level "JSON schema" interfaces.

We've chatted on this, but to continue the discussion openly, my gut feel is that some well-named kwarg on guidance.json that sets the "strict" or "ignore" style behavior is the right thing to do, and let users toggle it to match "true" json schema validation semantics maybe? I agree with your assessment that most people would be surprised to receive non-specified keys back from their constrained schema generation, so we should plan on making that opt-in. Not quite sure on the arg naming yet.

@hudson-ai
Copy link
Collaborator Author

@Harsha-Nori I was swayed to your side after talking, but thank you for documenting here! Currently thinking something like strict_properties, but still undecided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants