-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved JSON Schema validation for datasets #10543
base: develop
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
} | ||
} | ||
private static void throwValidationException(String key, List<String> argList) { | ||
throw new ValidationException(BundleUtil.getStringFromBundle("schema.validation.exception." + key, argList)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I am trying to figure out why an error message is appearing I often search the code with the full bundle key implied by the error message. This kind of split, hardcoding only part of the key instead of the whole key makes that difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key is dynamic. It could be a generic error message regardless of the specific field in question or it could be specific to a field.
ie. schema.validation.exception.value.missing is specific to the 'value' field where schema.validation.exception.dataset.required.missing is for any required field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I get why you are doing it this way - it's just that sometimes when I find myself debugging a production issue I will start with an error message or UI prompt and then find that verbiage in the bundle. Then I will search the code with the entire bundle key to see where in the code the message is coming from. When you split it like that the search will come up empty and I'll have to guess how much of the key to search on. Also some keys might end the same way. We can leave it like this, I'll just have to remind myself that there are times when I'll need to search with a partial bundle key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with this. I'm much more likely to search for the full schema.validation.exception.value.missing
than the partial value.missing
when hunting down a bug.
Raman used to do the same thing, used to not use the full key.
I can live with it for this PR but maybe we should write it up in our coding style page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here, just a style suggestion and a possible doc typo
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the IT tests and I think this PR adds a lot of value. I'm looking forward to real users providing feedback on it.
However, while testing, I found a couple things I think we should fix before merging.
Also, I'm a bit confused. Shouldn't the JSON Schema output (which we also capture at doc/sphinx-guides/source/_static/api/dataset-schema.json for the default case) get longer and more nuanced as we better define our rules? I thought the logic would go into the sample JSON Schema file itself. Maybe I'm missing something fundamental.
And we already have validation when datasets are created. Is it flawed? (Probably.) Should we replace it with the new logic in this pull request?
I'll keep playing a bit in the few minutes I have before I go on vacation but I'm going to kick this back to "in progress" so @stevenwinship can take a look. No objections if someone merges this (even as-is) when I'm gone!
static String rawSchema() { | ||
return """ | ||
{ | ||
"$schema": "http://json-schema.org/draft-04/schema#", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same content as doc/sphinx-guides/source/_static/api/dataset-schema.json
Can we read in that file rather than duplicating the content here in a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed to read from the existing file
@@ -3025,7 +3025,18 @@ api.errors.invalidApiToken=Invalid API token. | |||
api.ldninbox.citation.alert={0},<br><br> The {1} has just been notified that the {2}, <a href=''{3}''>{3}</a>, cites "<a href=''{4}/dataset.xhtml?persistentId={5}''>{6}</a>" in this repository. | |||
api.ldninbox.citation.subject={0}: A Dataset Citation has been reported! | |||
|
|||
#Schema Validation | |||
schema.validation.exception.value.missing=Invalid data for key:{0} typeName:{1}. 'value' missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I like how the errors look. From the IT test run:
- "Controlled vocabulary for key:fields typeName:language value:'badlang' is not valid."
- "Invalid data for key:fields typeName:title. Fields with 'multiple' set to true must be a list."
- "Invalid data for key:fields typeName:language. Found value as list but 'multiple' is set to false."
- "Compound value datasetContactName must match typeName of the object. Found datasetContactNme"
- "Invalid data for key:datasetContact typeName:datasetContactNotAllowed. Only datasetContactName, datasetContactAffiliation, datasetContactEmail allowed."
However, the double space shown in all these errors bothers me a bit (after with the following error:
):
"message": "The Dataset JSON provided failed validation with the following error: Controlled...
"message": "The Dataset JSON provided failed validation with the following error: Invalid data...
Can we reduce it down to a single space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the extra space
} | ||
} | ||
private static void throwValidationException(String key, List<String> argList) { | ||
throw new ValidationException(BundleUtil.getStringFromBundle("schema.validation.exception." + key, argList)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with this. I'm much more likely to search for the full schema.validation.exception.value.missing
than the partial value.missing
when hunting down a bug.
Raman used to do the same thing, used to not use the full key.
I can live with it for this PR but maybe we should write it up in our coding style page.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
Overview of the Feature Request
In version 6.1 we added a json schema for validating dataset json upload files. The first release only verifies the presence of valid json formatting as well as required elements and required fields (customized by collection). Enhancements which have been requested or should be contemplated.
What kind of user is the feature intended for?
(Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)
Users with edit dataset permission who wish to validate their dataset json prior to upload.
What inspired the request?
responses to the first release of the dataset schema which were made too late to be included in 6.1
What existing behavior do you want changed?
the get dataset schema api and validator
Which issue(s) this PR closes: #10169
Closes #10169
Special notes for your reviewer:
Suggestions on how to test this:
Does this PR introduce a user interface change? If mockups are available, please link/include them here: No
Is there a release notes update needed for this change?: Yes. Included
Additional documentation: native-api.rst
Preview at https://dataverse-guide--10543.org.readthedocs.build/en/10543/api/native-api.html#validate-dataset-json-file-for-a-collection