You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using serde_json in an application which requires serialized JSON for the same value to have a consistent hash. This can be accomplished by using a JSON canonicalization scheme, such as RFC 8785.
Issue #309 requested the ability to opt-in to canonical JSON serialization, but was closed as outside the scope of this library.
I think this library is very close to producing canonical JSON, and where it doesn't, there are easy workarounds available.
I think it would be valuable to document this, and, if possible, guarantee it, so that users can rely on this partial compatibility. This would be for serialization only, since adding checks that deserialized JSON is canonical would be a big lift. Serialization-only compatibility is, still, very valuable, since if an application always hashes JSON that it produces itself with serde_json, if it follows the workarounds, it can rely on that JSON being canonical.
To summarize RFC 8785:
No inter-token whitespace allowed.
Literals: null, true, and false are always serialized as null, true, and false.
Strings: All characters which have a dedicated escape (i.e. \n) are serialized with that dedicated escape character. Control characters in the range U+0000 through U+001F are serialized as \uHHHH, and all other characters are serialized as-is.
Numbers: Here is where the spec gets crazy. It defers to ECMA-262 for number serialization. However, all that complexity is for floating point numbers, integers are serialized in the normal way.
Object properties: Object properties must be sorted. Unfortunately, the spec requires that object properties be sorted as arrays of UTF-16 code points.
This yields a small number of workarounds that a current user of serde_json can use to produce canonical JSON;
Don't use pretty-printing.
No workaround needed, literals are already serialized as their canonical representation.
No workaround needed, strings are already serialized as their canonical representation.
Don't use floats. Integers are already serialized as their canonical representation.
When serializing a Value, don't use preserve_order, and don't use object properties with codepoints outside 0-127, which may have a different UTF-8 and UTF-16 sort order. When serializing with the derive macro, manually sort struct fields.
I think this workarounds are actually pretty easy to follow, and being able to rely on partial RFC 8785 compatibility would be valuable.
So, I think my proposal would be, as a first step, to document and guarantee those places where this library is RFC 8785 compatible, in particular:
No inter-token whitespace is added when not pretty printing.
Serialized literals are guaranteed to be canonical.
Serialized strings are guaranteed to be canonical.
Serialized integers are guaranteed to be canonical.
This would be super valuable, at least to me, since the above guarantees would make it easy for me to produce canonical JSON.
In addition, tests should be added to ensure that these things are actually true and stay true. In particular, adding tests for arbitrary precision integers, which I'm not sure are canonical, and tests for all the string edge cases would be nice.
The text was updated successfully, but these errors were encountered:
I'm using
serde_json
in an application which requires serialized JSON for the same value to have a consistent hash. This can be accomplished by using a JSON canonicalization scheme, such as RFC 8785.Issue #309 requested the ability to opt-in to canonical JSON serialization, but was closed as outside the scope of this library.
I think this library is very close to producing canonical JSON, and where it doesn't, there are easy workarounds available.
I think it would be valuable to document this, and, if possible, guarantee it, so that users can rely on this partial compatibility. This would be for serialization only, since adding checks that deserialized JSON is canonical would be a big lift. Serialization-only compatibility is, still, very valuable, since if an application always hashes JSON that it produces itself with
serde_json
, if it follows the workarounds, it can rely on that JSON being canonical.To summarize RFC 8785:
null
,true
, andfalse
are always serialized asnull
,true
, andfalse
.\n
) are serialized with that dedicated escape character. Control characters in the range U+0000 through U+001F are serialized as\uHHHH
, and all other characters are serialized as-is.This yields a small number of workarounds that a current user of
serde_json
can use to produce canonical JSON;Value
, don't usepreserve_order
, and don't use object properties with codepoints outside 0-127, which may have a different UTF-8 and UTF-16 sort order. When serializing with the derive macro, manually sort struct fields.I think this workarounds are actually pretty easy to follow, and being able to rely on partial RFC 8785 compatibility would be valuable.
So, I think my proposal would be, as a first step, to document and guarantee those places where this library is RFC 8785 compatible, in particular:
This would be super valuable, at least to me, since the above guarantees would make it easy for me to produce canonical JSON.
In addition, tests should be added to ensure that these things are actually true and stay true. In particular, adding tests for arbitrary precision integers, which I'm not sure are canonical, and tests for all the string edge cases would be nice.
The text was updated successfully, but these errors were encountered: