Improved JSON Schema validation for datasets #10543

stevenwinship · 2024-05-07T17:43:42Z

Overview of the Feature Request
In version 6.1 we added a json schema for validating dataset json upload files. The first release only verifies the presence of valid json formatting as well as required elements and required fields (customized by collection). Enhancements which have been requested or should be contemplated.

Controlled vocabulary support - get valid responses and verify that it/they are present (validate that there may be only one response or multiple responses as specified by the datasetfieldtype)
field type checking - validate that the values provided conform to the type specified by the datasetfieldtype)
enhanced error messages - the current validation library does not provide helpful error messages for every exception. See if there are other ways to provide useful information to the user.

What kind of user is the feature intended for?
(Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)
Users with edit dataset permission who wish to validate their dataset json prior to upload.

What inspired the request?
responses to the first release of the dataset schema which were made too late to be included in 6.1

What existing behavior do you want changed?
the get dataset schema api and validator
Which issue(s) this PR closes: #10169

Closes #10169

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No

Is there a release notes update needed for this change?: Yes. Included

Additional documentation: native-api.rst

Preview at https://dataverse-guide--10543.org.readthedocs.build/en/10543/api/native-api.html#validate-dataset-json-file-for-a-collection

coveralls · 2024-05-07T17:49:28Z

coverage: 20.627% (+0.05%) from 20.574%
when pulling 6e41658 on 10169-JSON-schema-validation
into 5bf6b6d on develop.

sekmiller · 2024-05-31T14:30:02Z

src/main/java/edu/harvard/iq/dataverse/validation/JSONDataValidation.java

+ }
+ }
+ private static void throwValidationException(String key, List<String> argList) {
+ throw new ValidationException(BundleUtil.getStringFromBundle("schema.validation.exception." + key, argList));


When I am trying to figure out why an error message is appearing I often search the code with the full bundle key implied by the error message. This kind of split, hardcoding only part of the key instead of the whole key makes that difficult.

The key is dynamic. It could be a generic error message regardless of the specific field in question or it could be specific to a field.
ie. schema.validation.exception.value.missing is specific to the 'value' field where schema.validation.exception.dataset.required.missing is for any required field.

No, I get why you are doing it this way - it's just that sometimes when I find myself debugging a production issue I will start with an error message or UI prompt and then find that verbiage in the bundle. Then I will search the code with the entire bundle key to see where in the code the message is coming from. When you split it like that the search will come up empty and I'll have to guess how much of the key to search on. Also some keys might end the same way. We can leave it like this, I'll just have to remind myself that there are times when I'll need to search with a partial bundle key.

Yeah, I agree with this. I'm much more likely to search for the full schema.validation.exception.value.missing than the partial value.missing when hunting down a bug.

Raman used to do the same thing, used to not use the full key.

I can live with it for this PR but maybe we should write it up in our coding style page.

doc/release-notes/10169-JSON-schema-validation.md

sekmiller

Great work here, just a style suggestion and a possible doc typo

sekmiller

Looks good.

pdurbin

I ran the IT tests and I think this PR adds a lot of value. I'm looking forward to real users providing feedback on it.

However, while testing, I found a couple things I think we should fix before merging.

Also, I'm a bit confused. Shouldn't the JSON Schema output (which we also capture at doc/sphinx-guides/source/_static/api/dataset-schema.json for the default case) get longer and more nuanced as we better define our rules? I thought the logic would go into the sample JSON Schema file itself. Maybe I'm missing something fundamental.

And we already have validation when datasets are created. Is it flawed? (Probably.) Should we replace it with the new logic in this pull request?

I'll keep playing a bit in the few minutes I have before I go on vacation but I'm going to kick this back to "in progress" so @stevenwinship can take a look. No objections if someone merges this (even as-is) when I'm gone!

pdurbin · 2024-06-14T19:50:51Z

src/test/java/edu/harvard/iq/dataverse/validation/JSONDataValidationTest.java

+ static String rawSchema() {
+ return """
+ {
+ "$schema": "http://json-schema.org/draft-04/schema#",


This is the same content as doc/sphinx-guides/source/_static/api/dataset-schema.json

Can we read in that file rather than duplicating the content here in a test?

fixed to read from the existing file

pdurbin · 2024-06-14T20:25:44Z

src/main/java/propertyFiles/Bundle.properties

@@ -3025,7 +3025,18 @@ api.errors.invalidApiToken=Invalid API token.
 api.ldninbox.citation.alert={0},<br><br> The {1} has just been notified that the {2}, <a href=''{3}''>{3}</a>, cites "<a href=''{4}/dataset.xhtml?persistentId={5}''>{6}</a>" in this repository.
 api.ldninbox.citation.subject={0}: A Dataset Citation has been reported!

+#Schema Validation
+schema.validation.exception.value.missing=Invalid data for key:{0} typeName:{1}. 'value' missing.


Overall, I like how the errors look. From the IT test run:

"Controlled vocabulary for key:fields typeName:language value:'badlang' is not valid."

"Invalid data for key:fields typeName:title. Fields with 'multiple' set to true must be a list."

"Invalid data for key:fields typeName:language. Found value as list but 'multiple' is set to false."

"Compound value datasetContactName must match typeName of the object. Found datasetContactNme"

"Invalid data for key:datasetContact typeName:datasetContactNotAllowed. Only datasetContactName, datasetContactAffiliation, datasetContactEmail allowed."

However, the double space shown in all these errors bothers me a bit (after with the following error:):

"message": "The Dataset JSON provided failed validation with the following error: Controlled... "message": "The Dataset JSON provided failed validation with the following error: Invalid data...

Can we reduce it down to a single space?

removed the extra space

pdurbin · 2024-06-14T20:32:27Z

src/main/java/edu/harvard/iq/dataverse/validation/JSONDataValidation.java

+ }
+ }
+ private static void throwValidationException(String key, List<String> argList) {
+ throw new ValidationException(BundleUtil.getStringFromBundle("schema.validation.exception." + key, argList));


Yeah, I agree with this. I'm much more likely to search for the full schema.validation.exception.value.missing than the partial value.missing when hunting down a bug.

Raman used to do the same thing, used to not use the full key.

I can live with it for this PR but maybe we should write it up in our coding style page.

github-actions · 2024-06-17T15:13:50Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10169-JSON-schema-validation

ghcr.io/gdcc/configbaker:10169-JSON-schema-validation

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Json dataset validation

3c1ad63

stevenwinship self-assigned this May 7, 2024

stevenwinship added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label May 7, 2024

This comment has been minimized.

Sign in to view

cmbz added Size: 50 A percentage of a sprint. 35 hours. and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels May 8, 2024

rework

33d6b56

This comment has been minimized.

Sign in to view

stevenwinship added 2 commits May 10, 2024 10:04

doc changes

ba8b01b

remove unused class

fe297ba

This comment has been minimized.

Sign in to view

stevenwinship removed their assignment May 10, 2024

sekmiller self-assigned this May 29, 2024

sekmiller reviewed May 31, 2024

View reviewed changes

doc/release-notes/10169-JSON-schema-validation.md Outdated Show resolved Hide resolved

sekmiller requested changes May 31, 2024

View reviewed changes

review comment mods

d40b2e5

This comment has been minimized.

Sign in to view

stevenwinship requested a review from sekmiller June 4, 2024 19:26

sekmiller approved these changes Jun 7, 2024

View reviewed changes

sekmiller removed their assignment Jun 7, 2024

pdurbin self-assigned this Jun 10, 2024

Merge branch 'develop' into 10169-JSON-schema-validation

8bdd4c1

This comment has been minimized.

Sign in to view

pdurbin changed the title ~~Json dataset validation~~ Improved JSON Schema validation for datasets Jun 10, 2024

doc and release note tweaks #10169

76d6ee4

This comment has been minimized.

Sign in to view

pdurbin reviewed Jun 14, 2024

View reviewed changes

pdurbin assigned stevenwinship and unassigned pdurbin Jun 14, 2024

remove extra space in error message per review comment

fc067f8

This comment has been minimized.

Sign in to view

remove hard coded datasetschema.json and load from file

a523093

This comment has been minimized.

Sign in to view

remove hard coded datasetschema.json and load from file

6e41658

stevenwinship removed their assignment Jun 17, 2024

pdurbin mentioned this pull request Jul 2, 2024

bklog: Deliverable - As a system integrator, I would appreciate a JSON Schema for validating my dataset JSON before uploading via API IQSS/dataverse-pm#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved JSON Schema validation for datasets #10543

Improved JSON Schema validation for datasets #10543

stevenwinship commented May 7, 2024 •

edited by pdurbin

Loading

coveralls commented May 7, 2024 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

sekmiller May 31, 2024

stevenwinship Jun 3, 2024

sekmiller Jun 7, 2024

pdurbin Jun 14, 2024

sekmiller left a comment

This comment has been minimized.

sekmiller left a comment

This comment has been minimized.

This comment has been minimized.

pdurbin left a comment •

edited

Loading

pdurbin Jun 14, 2024

stevenwinship Jun 17, 2024

pdurbin Jun 14, 2024

stevenwinship Jun 17, 2024

pdurbin Jun 14, 2024

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Jun 17, 2024

Improved JSON Schema validation for datasets #10543

Are you sure you want to change the base?

Improved JSON Schema validation for datasets #10543

Conversation

stevenwinship commented May 7, 2024 • edited by pdurbin Loading

coveralls commented May 7, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sekmiller left a comment

Choose a reason for hiding this comment

This comment has been minimized.

sekmiller left a comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

pdurbin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Jun 17, 2024

stevenwinship commented May 7, 2024 •

edited by pdurbin

Loading

coveralls commented May 7, 2024 •

edited

Loading

pdurbin left a comment •

edited

Loading