Document and guarantee partial RFC 8785 compatibility for serialization #1197

casey · 2024-10-11T20:18:28Z

I'm using serde_json in an application which requires serialized JSON for the same value to have a consistent hash. This can be accomplished by using a JSON canonicalization scheme, such as RFC 8785.

Issue #309 requested the ability to opt-in to canonical JSON serialization, but was closed as outside the scope of this library.

I think this library is very close to producing canonical JSON, and where it doesn't, there are easy workarounds available.

I think it would be valuable to document this, and, if possible, guarantee it, so that users can rely on this partial compatibility. This would be for serialization only, since adding checks that deserialized JSON is canonical would be a big lift. Serialization-only compatibility is, still, very valuable, since if an application always hashes JSON that it produces itself with serde_json, if it follows the workarounds, it can rely on that JSON being canonical.

To summarize RFC 8785:

No inter-token whitespace allowed.
Literals: null, true, and false are always serialized as null, true, and false.
Strings: All characters which have a dedicated escape (i.e. \n) are serialized with that dedicated escape character. Control characters in the range U+0000 through U+001F are serialized as \uHHHH, and all other characters are serialized as-is.
Numbers: Here is where the spec gets crazy. It defers to ECMA-262 for number serialization. However, all that complexity is for floating point numbers, integers are serialized in the normal way.
Object properties: Object properties must be sorted. Unfortunately, the spec requires that object properties be sorted as arrays of UTF-16 code points.

This yields a small number of workarounds that a current user of serde_json can use to produce canonical JSON;

Don't use pretty-printing.
No workaround needed, literals are already serialized as their canonical representation.
No workaround needed, strings are already serialized as their canonical representation.
Don't use floats. Integers are already serialized as their canonical representation.
When serializing a Value, don't use preserve_order, and don't use object properties with codepoints outside 0-127, which may have a different UTF-8 and UTF-16 sort order. When serializing with the derive macro, manually sort struct fields.

I think this workarounds are actually pretty easy to follow, and being able to rely on partial RFC 8785 compatibility would be valuable.

So, I think my proposal would be, as a first step, to document and guarantee those places where this library is RFC 8785 compatible, in particular:

No inter-token whitespace is added when not pretty printing.
Serialized literals are guaranteed to be canonical.
Serialized strings are guaranteed to be canonical.
Serialized integers are guaranteed to be canonical.

This would be super valuable, at least to me, since the above guarantees would make it easy for me to produce canonical JSON.

In addition, tests should be added to ensure that these things are actually true and stay true. In particular, adding tests for arbitrary precision integers, which I'm not sure are canonical, and tests for all the string edge cases would be nice.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document and guarantee partial RFC 8785 compatibility for serialization #1197

Document and guarantee partial RFC 8785 compatibility for serialization #1197

casey commented Oct 11, 2024 •

edited

Loading

Document and guarantee partial RFC 8785 compatibility for serialization #1197

Document and guarantee partial RFC 8785 compatibility for serialization #1197

Comments

casey commented Oct 11, 2024 • edited Loading

casey commented Oct 11, 2024 •

edited

Loading