Default behavior of `json` generation is likely more verbose than users expect #1050

hudson-ai · 2024-10-15T00:38:53Z

We currently respect JSON Schema's semantics around the additionalProperties keyword; i.e. leaving it unset is interpreted as "any property not specified by the properties keyword (1) is allowed and (2) has no restrictions on its value (other than being valid JSON)".

These semantics are useful (see #887), and I think we should continue respecting them.

That being said, I also think that the majority of our users will expect the LLM to only produce the properties that were explicitly requested. Anything "extra" costs the user time and money, and it will likely be thrown away.

I recommend the following solution:

Recommend that all users use the pydantic interface (passing a BaseModel subclass to the json generation function) UNLESS they want the extra fine-grained control that directly passing a JSON object provides.
Write a custom BaseModel.model_json_schema implementation that sets additionalProperties to False and use it to convert the BaseModel to a JSON schema we'll generate against.

I feel far more comfortable being "opinionated" when our users use the high-level pydantic interface rather than the low-level JSON interface.

Additional notes:

currently users CAN get this behavior from pydantic if they add model_config = ConfigDict(extra="forbid") to their BaseModel
extra is allowed to take on the following values: "ignore" (default), "allow", "forbid"
when extra is "ignore", any additional properties will be discarded when validating an instance, while "allow" will propagate their values to the constructed instance
it should be safe to coerce "ignore" to "forbid" since we know anything extra is going to be discarded anyway

The text was updated successfully, but these errors were encountered:

hudson-ai · 2024-10-15T00:40:30Z

@JC1DA @Harsha-Nori ping for visibility

Harsha-Nori · 2024-10-18T20:52:46Z

I think of json as a bit of a "higher level" guidance function (e.g. in contrast with select), and therefore think it's OK to have reasonable but configurable defaults across both the high-level pydantic and slightly lower-level "JSON schema" interfaces.

We've chatted on this, but to continue the discussion openly, my gut feel is that some well-named kwarg on guidance.json that sets the "strict" or "ignore" style behavior is the right thing to do, and let users toggle it to match "true" json schema validation semantics maybe? I agree with your assessment that most people would be surprised to receive non-specified keys back from their constrained schema generation, so we should plan on making that opt-in. Not quite sure on the arg naming yet.

hudson-ai · 2024-10-18T21:16:26Z

@Harsha-Nori I was swayed to your side after talking, but thank you for documenting here! Currently thinking something like strict_properties, but still undecided.

hudson-ai self-assigned this Oct 15, 2024

hudson-ai mentioned this issue Oct 29, 2024

[JSON] Add strict_properties kwarg to guidance.json to make JSON output terser by default #1068

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default behavior of `json` generation is likely more verbose than users expect #1050

Default behavior of `json` generation is likely more verbose than users expect #1050

hudson-ai commented Oct 15, 2024 •

edited

Loading

hudson-ai commented Oct 15, 2024

Harsha-Nori commented Oct 18, 2024

hudson-ai commented Oct 18, 2024

Default behavior of json generation is likely more verbose than users expect #1050

Default behavior of json generation is likely more verbose than users expect #1050

Comments

hudson-ai commented Oct 15, 2024 • edited Loading

hudson-ai commented Oct 15, 2024

Harsha-Nori commented Oct 18, 2024

hudson-ai commented Oct 18, 2024

Default behavior of `json` generation is likely more verbose than users expect #1050

Default behavior of `json` generation is likely more verbose than users expect #1050

hudson-ai commented Oct 15, 2024 •

edited

Loading