Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved JSON Schema validation for datasets #10543

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

stevenwinship
Copy link
Contributor

@stevenwinship stevenwinship commented May 7, 2024

Overview of the Feature Request
In version 6.1 we added a json schema for validating dataset json upload files. The first release only verifies the presence of valid json formatting as well as required elements and required fields (customized by collection). Enhancements which have been requested or should be contemplated.

  • Controlled vocabulary support - get valid responses and verify that it/they are present (validate that there may be only one response or multiple responses as specified by the datasetfieldtype)
  • field type checking - validate that the values provided conform to the type specified by the datasetfieldtype)
  • enhanced error messages - the current validation library does not provide helpful error messages for every exception. See if there are other ways to provide useful information to the user.

What kind of user is the feature intended for?
(Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)
Users with edit dataset permission who wish to validate their dataset json prior to upload.

What inspired the request?
responses to the first release of the dataset schema which were made too late to be included in 6.1

What existing behavior do you want changed?
the get dataset schema api and validator
Which issue(s) this PR closes: #10169

Closes #10169

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No

Is there a release notes update needed for this change?: Yes. Included

Additional documentation: native-api.rst

Preview at https://dataverse-guide--10543.org.readthedocs.build/en/10543/api/native-api.html#validate-dataset-json-file-for-a-collection

@stevenwinship stevenwinship self-assigned this May 7, 2024
@stevenwinship stevenwinship added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label May 7, 2024
@coveralls
Copy link

coveralls commented May 7, 2024

Coverage Status

coverage: 20.627% (+0.05%) from 20.574%
when pulling 6e41658 on 10169-JSON-schema-validation
into 5bf6b6d on develop.

This comment has been minimized.

@cmbz cmbz added Size: 50 A percentage of a sprint. 35 hours. and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels May 8, 2024

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

@stevenwinship stevenwinship removed their assignment May 10, 2024
@sekmiller sekmiller self-assigned this May 29, 2024
}
}
private static void throwValidationException(String key, List<String> argList) {
throw new ValidationException(BundleUtil.getStringFromBundle("schema.validation.exception." + key, argList));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I am trying to figure out why an error message is appearing I often search the code with the full bundle key implied by the error message. This kind of split, hardcoding only part of the key instead of the whole key makes that difficult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key is dynamic. It could be a generic error message regardless of the specific field in question or it could be specific to a field.
ie. schema.validation.exception.value.missing is specific to the 'value' field where schema.validation.exception.dataset.required.missing is for any required field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I get why you are doing it this way - it's just that sometimes when I find myself debugging a production issue I will start with an error message or UI prompt and then find that verbiage in the bundle. Then I will search the code with the entire bundle key to see where in the code the message is coming from. When you split it like that the search will come up empty and I'll have to guess how much of the key to search on. Also some keys might end the same way. We can leave it like this, I'll just have to remind myself that there are times when I'll need to search with a partial bundle key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with this. I'm much more likely to search for the full schema.validation.exception.value.missing than the partial value.missing when hunting down a bug.

Raman used to do the same thing, used to not use the full key.

I can live with it for this PR but maybe we should write it up in our coding style page.

Copy link
Contributor

@sekmiller sekmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here, just a style suggestion and a possible doc typo

This comment has been minimized.

Copy link
Contributor

@sekmiller sekmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@sekmiller sekmiller removed their assignment Jun 7, 2024
@pdurbin pdurbin self-assigned this Jun 10, 2024

This comment has been minimized.

@pdurbin pdurbin changed the title Json dataset validation Improved JSON Schema validation for datasets Jun 10, 2024

This comment has been minimized.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the IT tests and I think this PR adds a lot of value. I'm looking forward to real users providing feedback on it.

However, while testing, I found a couple things I think we should fix before merging.

Also, I'm a bit confused. Shouldn't the JSON Schema output (which we also capture at doc/sphinx-guides/source/_static/api/dataset-schema.json for the default case) get longer and more nuanced as we better define our rules? I thought the logic would go into the sample JSON Schema file itself. Maybe I'm missing something fundamental.

And we already have validation when datasets are created. Is it flawed? (Probably.) Should we replace it with the new logic in this pull request?

I'll keep playing a bit in the few minutes I have before I go on vacation but I'm going to kick this back to "in progress" so @stevenwinship can take a look. No objections if someone merges this (even as-is) when I'm gone!

Comment on lines 32 to 35
static String rawSchema() {
return """
{
"$schema": "http://json-schema.org/draft-04/schema#",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same content as doc/sphinx-guides/source/_static/api/dataset-schema.json

Can we read in that file rather than duplicating the content here in a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed to read from the existing file

@@ -3025,7 +3025,18 @@ api.errors.invalidApiToken=Invalid API token.
api.ldninbox.citation.alert={0},<br><br> The {1} has just been notified that the {2}, <a href=''{3}''>{3}</a>, cites "<a href=''{4}/dataset.xhtml?persistentId={5}''>{6}</a>" in this repository.
api.ldninbox.citation.subject={0}: A Dataset Citation has been reported!

#Schema Validation
schema.validation.exception.value.missing=Invalid data for key:{0} typeName:{1}. 'value' missing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I like how the errors look. From the IT test run:

  • "Controlled vocabulary for key:fields typeName:language value:'badlang' is not valid."
  • "Invalid data for key:fields typeName:title. Fields with 'multiple' set to true must be a list."
  • "Invalid data for key:fields typeName:language. Found value as list but 'multiple' is set to false."
  • "Compound value datasetContactName must match typeName of the object. Found datasetContactNme"
  • "Invalid data for key:datasetContact typeName:datasetContactNotAllowed. Only datasetContactName, datasetContactAffiliation, datasetContactEmail allowed."

However, the double space shown in all these errors bothers me a bit (after with the following error:):

"message": "The Dataset JSON provided failed validation with the following error:  Controlled...
"message": "The Dataset JSON provided failed validation with the following error:  Invalid data...

Can we reduce it down to a single space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the extra space

}
}
private static void throwValidationException(String key, List<String> argList) {
throw new ValidationException(BundleUtil.getStringFromBundle("schema.validation.exception." + key, argList));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with this. I'm much more likely to search for the full schema.validation.exception.value.missing than the partial value.missing when hunting down a bug.

Raman used to do the same thing, used to not use the full key.

I can live with it for this PR but maybe we should write it up in our coding style page.

@pdurbin pdurbin assigned stevenwinship and unassigned pdurbin Jun 14, 2024

This comment has been minimized.

This comment has been minimized.

@stevenwinship stevenwinship removed their assignment Jun 17, 2024
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10169-JSON-schema-validation
ghcr.io/gdcc/configbaker:10169-JSON-schema-validation

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 50 A percentage of a sprint. 35 hours.
Projects
Status: Ready for QA ⏩
Development

Successfully merging this pull request may close these issues.

Feature Request/Idea: JSON Schema Enhancements
5 participants