Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistancy issues in JSON schema #75

Open
xXyepXx opened this issue Jan 20, 2024 · 14 comments
Open

Consistancy issues in JSON schema #75

xXyepXx opened this issue Jan 20, 2024 · 14 comments
Labels
enhancement New feature or request

Comments

@xXyepXx
Copy link

xXyepXx commented Jan 20, 2024

Describe the bug
The definition of the schema in schema.json specifies that some fields may be one of several types.

For instance, the field implies can either be a string or an array (of strings):

"implies": {
  "oneOf": [
    {
      "type": "array",
      "items": {
        "$ref": "#/definitions/non-empty-non-blank-string"
      }
    },
    {
      "$ref": "#/definitions/non-empty-non-blank-string"
    }
  ]
},

This makes it really hard to parse the JSON object in languages like Go where you need to define the types statically:

type Techno struct {
    Categories  []int    `json:"cats"`
    Implies     []string `json:"implies"`  // This may just be a string!
}

Trying to deserialize a JSON object into an instance of this struct will return an error if the JSON input is using a string type instead of an array of strings:

json: cannot unmarshal string into Go struct field Techno.implies of type []string

Expected behavior
Have a single type for each field.

@enthec-opensource
Copy link
Member

This is an issue we have encountered aswell
the original project is not ours, we just started maintaining adding technologies and fixing json objects, always trying not to break old implementations of wappalyzer.
we could deal with this by just replacing all "[]string OR string" by just "[]string" as webappanalyzer implementations parse string or []string anyways.
Im going to leave this issue open and link commits related to this.
is not as good practice and makes the code less readable, but you can json unmarshal to map[string]interface{} and assert the interface{} (obj.([]string)) or just by using "reflect".

@enthec-opensource enthec-opensource added the enhancement New feature or request label Jan 29, 2024
@kingthorin
Copy link
Contributor

kingthorin commented Feb 8, 2024

I was the one that added the schema to the original project, it was added 'after the fact'. Then was modified further after that. Since it was added after the data files existed (not before), it had to conform to the data present after years of different people contributing with no guidance or format restrictions.

I agree if it was more strongly typed that would really help.

On the off chance that it might help someone here's the parser we use within ZAPs technology detection add-on: https://github.com/zaproxy/zap-extensions/blob/main/addOns/wappalyzer/src/main/java/org/zaproxy/zap/extension/wappalyzer/WappalyzerJsonParser.java
(That code has also evolved over time and been worked on by maybe a dozen people so if it seems horrid I won't be surprised 😉)

@mlec1
Copy link
Contributor

mlec1 commented Jul 16, 2024

I made some contribution in order to unify and simplify the schema.
I hope it fits your needs. If you see other fields that could be improved, just raise your idea

@kingthorin
Copy link
Contributor

Thanks. Seems fine on my end. Nice work!

@enthec-opensource
Copy link
Member

unsure about what to do with the dom selectors, there are too many variations

string
"dom": "form[name='formLogin'][action='login.aspx' i][id='formLogin']"

string array:
"dom": ["form[name='formLogin'][action='login.aspx' i][id='formLogin']", ...]

object
"dom": {
"img[src*='provenexpert']": {
"attributes": {
"src": "images\.provenexpert\.\w+"
}
}
}

I'd say transform string and string array to the object structure, like "dom": {"form[name='formLogin'][action='login.aspx' i][id='formLogin']": ""}, seems more flexible.

However this change might break some implementations so i thought on creating a new one called "selector" and deprecating the "dom" one.

@kingthorin
Copy link
Contributor

Well assuming people are already using the objects then moving strings and arrays into objects "shouldn't" break anything ... Though I guess there might be some just using the strings and string array elements 🤷‍♂️

@enthec-opensource
Copy link
Member

Yeah the thing is i'm not sure what is the people using right now, my guess would be json object aswell, I checked some libs on github and people are not even using the query selectors

@kingthorin
Copy link
Contributor

I can say that ZAP uses both but I think our implementation south continue to work.

The easy, consistent, and quick'ish thing I guess is to make all the single strings into single element arrays as was done elsewhere. Then atleast it's down to two types vs. three.

@enthec-opensource
Copy link
Member

I guess, still not consistent you can't properly map it on a strongly typed language -- but I'm not sure if you could anyways, since it seems to have unknown depth, so...

@enthec-opensource
Copy link
Member

there appears to be a limited amout of keys inside the selector

  • attributes: with an object inside ex: "dom": {"attributes": {"href": "pattern", "src": "pattern"}}
  • properties: with an object inside ex: "dom": {"properties": {"_reactRootContainer": "pattern"}}
  • text: with a string inside ex: "dom": {"text": "pattern"}
  • exists: with an empty string inside ex: "dom": {"exists": ""}

any of the "pattern" or empty string "" can contain the \;confidence and version thing inside

Of course many if them seem to be very broken so im going to work on them now.

But still you can't map it because even tho attributes, text or exists seem to be mandatory, the key that contains it is dynamic, as well as the keys inside the attributes obj...

a better way of mapping those objects would be like

{
  "dom": [
    {
      "selector": "link[href*='/wp-content/plugins/sfwd-lms/']",
      "attributes": [
        {
          "name": "href",
          "value": "whatever"
        },
        {
          "name": "src",
          "value": ""
        }
      ],
      "text": "Powered By ...(\\d+)\\;version:\\1"
    }
  ]
}

so its easier to scale aswell like this

{
  "dom": [
    {
      "selector": "link[href*='/wp-content/plugins/123/']",
      "attributes": [
        {
          "name": "href",
          "value": "whatever"
        },
      ]
    },
    {
      "selector": "link[href*='/wp-content/plugins/456/']",
      "text": "Powered By ...(\\d+)\\;version:\\1"
    }
  ]
}

also I'm not sure what the exists is for, since if you dont want it, just dont put it.

@enthec-opensource
Copy link
Member

enthec-opensource commented Jul 25, 2024

anyways, knowing this structure at least we can properly validate everything, not sure how many people use this objects as they are supposed to since there is no documentation of it.

Also many of the objects are broken no idea how this has been working in prod, I've seen at least 5 or 6 randomly structured objects.

@kingthorin
Copy link
Contributor

Remember they aren't regex they're element selectors.

Our project is using them without issue.

@enthec-opensource
Copy link
Member

nono, the selectors are correct, i already validated them all and its included in the github workflow already, but i meant that some objects are not configured following the structure required, for example:
image

im not sure about the {"exists": ""} thing

@kingthorin
Copy link
Contributor

Oh okay, now I see what you mean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants