Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anti-issue: YAML::PP parses JSON that all the other perl JSON modules can't! #1

Open
warewolf opened this issue Mar 27, 2018 · 9 comments

Comments

@warewolf
Copy link

So yeah, this is an anti-issue - I discovered recently that JSON is "a subset of YAML 1.2"; and then discovered YAML::PP. In short: Thank you. YAML::PP doesn't bomb on JSON that is produced with ham-fisted UTF-8 encoding.

It appears that one company in particular that distributes a data feed has somehow "switched on" interpreting all data ingested as UTF-8, even when it wasn't UTF-8 encoded. Imagine interpreting the header of a ZIP file as Unicode. The result is corrupted garbage, and it isn't standards compliant.

Example:
{"Subject": "CN=\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u0531/OU=\ufffd\ufffd\u01b4\ufffd/OU=\u027d\ufffd\ufffd\ufffd\ude64\ufffd\ufffd\u0467/O=sdlg" }

Nothing else in Perl land seems to be able to parse the above JSON document. YAML::PP does, as of v0.005.

My request: Please let this continue to be the case. If you do end up adding validation of unicode character sequences, give folks an option to turn it off.

@perlpunk
Copy link
Owner

Heh, that's funny.
My plan actually is to fix that because, as you said yourself, it's invalid (so the JSON modules saying "missing high surrogate character in surrogate pair" are right).
But I see from your example that allowing to turn validation off can actually be helpful.

I'll leave this open until I added validation and a corresponding configuration option.

@warewolf
Copy link
Author

Thank you! And yes, it's been a frustrating experience; the folks who generate the data feed don't seem to think it's their problem to solve. ☹️

perlpunk added a commit that referenced this issue Mar 5, 2019
On some systems inf and nan seem to be broken ('0'):
uname='Win32 strawberryperl 5.12.2.0 #1 Fri Nov  5 05:17:27 2010 i386'

t/32.cyclic-refs.t dies, add some debugging
@choroba
Copy link

choroba commented Apr 2, 2019

Just a comment: jq seems to handle this input without complaints.

@warewolf
Copy link
Author

warewolf commented Apr 2, 2019

@choroba what version of JQ? Last I checked, JQ was still throwing errors on this sort of badly formed JSON.

@choroba
Copy link

choroba commented Apr 2, 2019

jq-1.5. It seems 1.6 should be around, too, so maybe it's different.

@warewolf
Copy link
Author

warewolf commented Apr 2, 2019

Okay, so yes; JQ does parse the above example snippet of butchered UTF JSON; but I've got worse examples that JQ barfs on from the same data feed. Either way, YAML-PP is still the only way I can reliably parse this kind of JSON in perl, and I'm super happy that it still works.

@pali
Copy link

pali commented Jan 23, 2020

@warewolf That JSON is invalid due to unpaired half of surrogate pair. How would like you handle and decode invalid JSON? Such string does not have representation in UTF-8, so you cannot load & decode it. I see there two options: 1) Skip every non-parsable byte in input or 2) Replace non-parsable tokens in JSON string by Unicode replacement character. But both options changes input, so when processing it in Perl you would have something different.

I understand analytical reasons trying to process as many data as possible, but when on input are invalid data it is needed to specify how to non-reversible handle them.

@warewolf
Copy link
Author

@pali well, because of how the JSON is already mangled (non UTF-8 interpreted as UTF-8, which gets completely fubar "this can't be represented in UTF-8") I honestly don't expect this to to be reversible to something consistent. For my use, the actual string values that are corrupted are irrelevant, the rest of the JSON structure I'm parsing does have value, so for me the important part is not bailing on parsing the entire JSON object.

Sadly I can't fix the origin data because it's from a commercial data feed, and apparently python gladly will serialize to invalid JSON?

@pali
Copy link

pali commented Jan 23, 2020

I can imagine that maintainer of Cpanel::JSON::XS could accept optional feature to process also invalid JSON strings and replace invalid characters by Unicode replacement character. So if you have really use cases (which seems that yes), open an issue/feature request for Cpanel::JSON::XS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants