Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegEx Validation Using Java Patterns (perf) #176

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

olnrao
Copy link

@olnrao olnrao commented Feb 15, 2016

While using this library we found RhinoHelper takes way too much time for pattern validations using Visual VM CPU Profiling.

Here is an observation:

  • JSON Schema: Roughly 30 fields, 10 fields have one or the other validation, 5 of them have patterns (listed below)
  • Input Data: An array of 10K messages
  • Time Taken:
    -- With out pattern annotation: ~6 seconds
    -- With Rhino Helper: ~30 seconds
    -- With Java Patterns: ~7 seconds

Patterns used:
"pattern": "^(19[7-9][0-9]|[2-9][0-9][0-9][0-9])-",
"pattern" : "^[+-][0-1][0-9]:[0-5][0-9]$",
"pattern": "^(19[7-9][0-9]|[2-9][0-9][0-9][0-9])-",
"pattern": "[^\s]{1}",
"pattern": "[^\s]{1}",

While using this library we found RhinoHelper takes way too much time for pattern validations using Visual VM CPU Profiling.

Here is an observation:
- JSON Schema: Roughly 30 fields, 10 fields have one or the other validation, 5 of them have patterns (listed below)
- Input Data: An array of 10K messages
- Time Taken: 
-- With out pattern annotation: ~6 seconds
-- With Rhino Helper: ~30 seconds
-- With Java Patterns: ~7 seconds

Patterns used:
      "pattern": "^(19[7-9][0-9]|[2-9][0-9][0-9][0-9])-",
      "pattern" : "^[+-][0-1][0-9]:[0-5][0-9]$",
      "pattern": "^(19[7-9][0-9]|[2-9][0-9][0-9][0-9])-",
      "pattern": "[^\\s]{1}",
      "pattern": "[^\\s]{1}",
@huggsboson
Copy link
Member

Hi, thanks for the pull request. Sadly the json-schema spec explicitly dictates that the patterns should be ecma-262 style regexes. Changing to just use java's pattern matcher will more than likely introduce some incompatibilities. Maybe if we can dependency inject the pattern matcher you can have your own "non-standard" but fast checker. Or if there is a native version of java regexp matcher that more closely mirrors javascript rules.

@olnrao
Copy link
Author

olnrao commented Feb 17, 2016

Thanks for looking into this and a quick response, huggsboson.

I was not aware of ECMA standard alignment. So, scratch my patch :).

Coming to alternatives, I would like to have following changes (which I
would love to contribute):

  • For date-time, support min and max values. For example, birth_date field
    is a date-time and I want to have min-value as Jan 1, 1900
  • For timezone, can we represent w/ a more specific data type than just
    string?

If both of the above changes are possible (and does not conflict with ECMA
Standard), then I would like to work on those changes.

Two of the patterns we use are actually date patterns :)

Thanks,
Laxmi

On Wed, Feb 17, 2016 at 3:07 AM, huggsboson [email protected]
wrote:

Hi, thanks for the pull request. Sadly the json-schema spec explicitly
dictates that the patterns should be ecma-262 style regexes. Changing to
just use java's pattern matcher will more than likely introduce some
incompatibilities. Maybe if we can dependency inject the pattern matcher
you can have your own "non-standard" but fast checker. Or if there is a
native version of java regexp matcher that more closely mirrors javascript
rules.


Reply to this email directly or view it on GitHub
#176 (comment)
.

Laxmi

@huggsboson
Copy link
Member

I'm trying to think... this feels like something where a native date-time type would do a better job than trying to extend regular expressions. I'm less familiar with what we have already built-in with regards to dates and times, but I believe there is something. (Maybe see if the spec has a reference to dates?)

@azzmosphere
Copy link

Is this still and issue? I have noticed that the issue has not been commented on since March 2016 and that the latest version of the validator allows for the "format":"date-time" keywords which implements the RFC 3339, section 5.6 interpretation of date time. I am therefore assuming this issue to be no longer relevant.

I tested this with the following code:

{ "$schema" : "http://json-schema.org/draft-04/schema#", "format":"date-time" } ... "2017-01-01T10:10:10.111Z"
and it seemed good, and all though I did not check the speed, judging by @huggsboson comment believe this definitely proves there is something already build-in with regards to dates and times.

@sampath-methuku
Copy link

This Sampath an engineer who worked with @olnrao, and observed the performance issue with regex pattern matchings. I will run the same test data against new code in a couple of days and will let you know whether performance has really improved or not.

public PatternValidator(final JsonNode digest)
{
_patterns = new HashMap<String, Pattern>();
super("pattern");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super should be the first statement in the constructor, otherwise, compilation fail with

call to super must be first statement in constructor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@sampath-methuku
Copy link

@olnrao there is compilation issue please fix that,

I have fixed that compilation locally and tested with following.

  1. schema with 3 date patterns, 2 text field patterns ```
    "pattern": "^(19[7-9][0-9]|[2-9][0-9][0-9][0-9])-",
    "pattern" : "^[+-][0-1][0-9]:[0-5][0-9]$",
    "pattern": "^(19[7-9][0-9]|[2-9][0-9][0-9][0-9])-",
    "pattern": "[^\s]{1,3000}",
    "pattern": "[^\s]{1,200}",

JSON data file had 1000 messages, iterated the same JSON input validation with schema 1000 times.
done the above the exercise with  existing latest version of json-schema-validator i.e 2.2.6 
 and with the fix regex-validation-perf.

we have observed following results.

CPU profiling is done using visual tool

Total Time taken by com.github.fge.jsonschema.keyword.validator.common.PatternValidator.validate() 

with 2.2.6                                  -->  295,955ms
with regex-validation-perf fix  -->      8,000ms

which is a very good improvement in performance,   Attaching the results 

![cpu-profile-with-regex-validation-perf-fix](https://cloud.githubusercontent.com/assets/15689073/23769761/724b61a8-0536-11e7-955c-fc5f46ba2f73.png)

![cpu-profile-with-2 2 6-version](https://cloud.githubusercontent.com/assets/15689073/23769762/724c89ac-0536-11e7-8ecf-2722f4a58b64.png)




[snapshorts.zip](https://github.com/daveclayton/json-schema-validator/files/832246/snapshorts.zip)

@sampath-methuku
Copy link

The actual issue for which this pull request was raised is,

for regular expression in the schema, pattern matching performance is very poor for this library since it is not storing the compiled pattern of regular expression , for very JSON message validation
code was using regMatch(regex ,text), which is less performance compared to pre-compiled patterns
@olnrao gave performance fix by adding code to compile and store the complied Pattern for regex in a local cache(hashMap) and reusing of this compiled Pattern for all message pattern matching for improving the performance.

we have date-time format fields in our scheme but we want min & max values for this fields.
ex: we want to accept filed
"publishedAt" with date-time values and should be after Unix epoch i.e [1970-01-01 00:00:00],
"format":"date-time" "pattern": "^(19[7-9][0-9]|[2-9][0-9][0-9][0-9])-"
"birthDate" with date-time values and should not be before 1900-01-01 00:00:00
"format":"date-time" "pattern": "^(19[0-9][0-9]|[2-9][0-9][0-9][0-9])-"
"name" text filed should accept only english characters (at least 1 character)
"pattern": "^[a-zA-Z]+$"

the date-time format in the schema can be applied on string type, but there is no native date-time type in json schema.

so basically along with date-time format support in JSON schema we need, min date and max date support in JSON shema,

schema
{ "$schema": "http://json-schema.org/draft-04/schema#", "type": "object", "properties": { "publishedAt": { "type": "string", "format": "date-time", "pattern": "^(19[7-9][0-9]|[2-9][0-9][0-9][0-9])-" }, "birthDate": { "type": "string", "format": "date-time", "pattern": "^(19[0-9][0-9]|[2-9][0-9][0-9][0-9])-" }, "name": { "type": "string", "pattern": "^[a-zA-Z]+$" } } }
mesage
{ "publishedAt": "2010-07-29T17:47:46.000Z", "birthDate": "1920-07-29T17:47:46.000Z", "name": "Raj" }

So I strongly support this fix and recommend fix should go to master branch. Please let me know any thing I have missed here or mis-interpreted.

@azzmosphere
Copy link

@sampath-methuku aside from the change you mentioned above that fixes the compilation error, "
super("pattern");" I am all for the fix. Compiled regular expressions are much better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants