JSON #6

01mf02 · 2022-08-08T10:09:32Z

01mf02
Aug 8, 2022
Maintainer

I have understood that one motivation behind the new Dedukti standard is the exchange of theories across tools, such as Dedukti, Lambdapi etc.
I would like to therefore propose a seemingly radical idea: Use JSON to encode theories!

JSON is an industry standard for data representation. There is a vast ecosystem of tools around it that helps with processing / parsing / generation. Moreover, it can also still be read by human beings.

I hereby propose a preliminary encoding of lambda-Pi terms in JSON:

Constants without a module prefix are encoded as strings, such as "eta". This allows us to drop syntax à la {|...|} to escape special characters.
Constants with a module prefix are encoded as maps with a single element "c", namely {"c": ["theory_u", "prop"]}. This allows us to extend the module system later to nested modules, if desired.
Variables are encoded as de Bruijn indices (unsigned integers).
Applications are encoded as lists, that is, something like eta (arrow x y) in DK2 syntax becomes ["eta", ["arrow", 1, 0]]. (The empty list could be used to encode Type / TYPE.)
Variable bindings (as they occur for example in lambda- and Pi-abstractions) are encoded as maps, where a variable of name x and of type t is encoded as: {"v": x, "t": t}. Both keys v and t can be omitted for lambda-abstractions, and for Pi-abstractions, only v can be omitted.
A lambda-abstraction with variable bindings b1, ..., bn to a term t is encoded as {"l": [b1, ..., bn], "t": t}.
Pi-abstractions are encoded like lambda-abstractions, only with the key "P" instead of "l".

Example: The DK term

n: nat -> l: list n -> x: A -> list (succ n)

becomes

{"P": [{"v": "n", "t": "nat"}, {"v": "l", "t": ["list", 0]}, {"v": "x", "t": "A"}], "t": ["list", ["succ", 2]]}

We could also omit the variable names, yielding:

{"P": [{"t": "nat"}, {"t": ["list", 0]}, {"t": "A"}], "t": ["list", ["succ", 2]]}

While this looks a bit scary, using jq ., I was able to quickly produce a formatted and syntax-highlighted version of this term:

Pros of the JSON encoding:

Fast & simple parsing: I was able to write a parser/writer for the shown JSON terms using the serde framework in about three hours (without ever having done something like this before). The resulting parser (without any effort spent on optimisation) was already as fast as my hand-rolled, optimised parser developed over weeks. Given that parsing performance is currently a bottleneck in Dedukti, this could speed up Dedukti as a whole (OCaml bindings).
Tooling: There exist many tools to process JSON, such as the aforementioned jq. This makes it very easy to process the data; for example, to obtain the list of all symbols occurring in a theory, it suffices to run: jq '.. | select(type == "string")'
Size: In the example above, the JSON version is larger than the equivalent DK. However, in a preliminary experiment where I converted the terms of the Isabelle/HOL dataset to JSON, the JSON output was roughly the same size as the Dedukti input.
More efficient encoding possible: JSON files can be easily converted to a more efficient binary format such as MessagePack (OCaml bindings). In a preliminary experiment, this yielded significantly smaller output than DK.
No scoping required: Storing variables as de Bruijn indices can be considered a pro or con. This representation is definitely more efficient, however, readability suffers.
Backward compatibility: Because there are no keywords in JSON, we do not have the situation in Dedukti that adding new keywords (such as "private") breaks older theories that used "private" as identifier somewhere.

Cons of JSON:

Readability: I wholeheartedly admit that reading JSON terms is not as easy as reading DK2 terms. However, if we consider this format mostly as an exchange format for automatically generated proofs, then this point might not matter that much.

The encoding I showed above could be generalised to commands such as def, thm, rewrite rules and so on.

My vision would be the generation and exchange of Dedukti theories happening in a format as described above, while writing theories manually could continue to be performed in the established syntax of Dedukti or Lambdapi.

I am looking forward to your opinions on this.

01mf02 · 2022-08-08T14:17:57Z

01mf02
Aug 8, 2022
Maintainer Author

P.S.: To give you an idea of the theory sizes for Isabelle/HOL:

Format	Size	BZ2'ed
DK	2582MB	20.7MB
JSON (terms only)	2647MB	26.4MB
MessagePack (terms only)	2276MB	22.9MB

Note: Terms make up for about 99% of the theory size.

0 replies

dwarfmaster · 2022-08-08T14:27:33Z

dwarfmaster
Aug 8, 2022
Maintainer

With an appropriate pretty-printer, it might actually be a good idea, but I can't help but think the readability would make debugging translators more complicated.

1 reply

01mf02 Aug 10, 2022
Maintainer Author

What would you think about the readability of terms as S-expressions?

Something along the lines of:

(:lam x #(y bool) (eq x y)) for x => y : bool => eq x y
(:Pi nat #(x nat) #(y bool) (x y)) for nat -> x : nat -> y : bool -> x y
(mod1 mod2 . asdf) for mod1.mod2.asdf (if Dedukti would ever have nested modules ^^)

These would allow you also to make simple parsers (and probably also process them automatically to some degree), but allow for a vastly more compressed representation.

francoisthire · 2022-08-08T17:59:54Z

francoisthire
Aug 8, 2022
Maintainer

I don't think we need only one format for interoperability and for debugging. One can use two formats in the same way we already have a dk and dko in some sense.

Claudio Sacerdoti Cohen proposed a similar idea for Coq using XML. Since JSON is somehow the new XML it could make sense.

Overall, it could make sense to use JSON. But I kinda of disagree with the current pros/cons list proposed:

I was able to write a parser/writer for the shown JSON terms using the serde framework in about three hours (without ever having done something like this before).

I think menhir users would write a parser in the same amount of time. Writing an efficient parser with menhir, is, AFAIK, not a real issue.

The resulting parser (without any effort spent on optimisation) was already as fast as my hand-rolled, optimised parser developed over weeks.

I would prefer to compare with Menhir in that case. You are using a framework, the same way Menhir can be seen as a framework.

Tooling: There exist many tools to process JSON, such as the aforementioned jq. This makes it very easy to process the data; for example, to obtain the list of all symbols occurring in a theory, it suffices to run: jq '.. | select(type == "string")'

That's a HUGE pro. Most common programming languages use such a tooling already.

Size: In the example above, the JSON version is larger than the equivalent DK. However, in a preliminary experiment where I converted the terms of the Isabelle/HOL dataset to JSON, the JSON output was roughly the same size as the Dedukti input.

Since it is a bout the same, I would say it is nor a pro nor a cons.

No scoping required: Storing variables as de Bruijn indices can be considered a pro or con. This representation is definitely more efficient, however, readability suffers.

I don't understand why it is a pro here. This depends on your internal representation of terms. I don't understand how this point differs from Menhir grammar for example.

Constants without a module prefix are encoded as strings, such as "eta". This allows us to drop syntax à la {|...|} to escape special characters.

Are you sure about this point?

Readability: I wholeheartedly admit that reading JSON terms is not as easy as reading DK2 terms. However, if we consider this format mostly as an exchange format for automatically generated proofs, then this point might not matter that much.

I think it does matter. However, it also means we can have an external tool like dkprettify which takes JSON input and shows readable output. This could even be integrated as a plugin inside an editor.

Which leads to another cons: It might require more tooling on our side, especially to support text editors.

7 replies

francoisthire Aug 13, 2022
Maintainer

It depends on your definition of "efficient". If you mean "time-efficient", then I have to disagree with you.
My research shows that on many large proof corpora, parsing with Dedukti's current Menhir parser takes significant amounts of time.

My usage of Dedukti shows that time-wise, it is most of the time the reduction engine which is a roadblock. If we are looking the best performances for writing a parser, taking into account readability is weird. In that case, I would rather consider two formats and the point is it's easy to go from one format to another.

Yes --- excluding pathological cases, such as invalid Unicode or such.

That's my point. For example, what about identifiers containing "?

What do you think about the S-expression syntax I proposed?

I don't think S-expression are really readable. It is better than JSON but still harder to understand than a syntax with infix operators. But it is not a strong opinion either. I would need to see larger terms using this S-expression syntax.

Another point. We could optimize your representation by removing objects with field "v" and "t" for products and just having an array instead, no?

01mf02 Aug 18, 2022
Maintainer Author

My usage of Dedukti shows that time-wise, it is most of the time the reduction engine which is a roadblock.

It probably depends on the dataset you are looking at.
Among all three datasets I have analysed, the parser of Dedukti took between 25% and 48% of the total runtime.
On the Isabelle/HOL dataset, we can see that checking & parsing can be done even faster than just parsing alone in Dedukti.

(DK is Dedukti, KO is Kontroli, DK∩p is just Dedukti's parser, and KO∩p is just Kontroli's parser)

If we are looking the best performances for writing a parser, taking into account readability is weird. In that case, I would rather consider two formats and the point is it's easy to go from one format to another.

And if we could find a format that is both readable and fast to parse?

01mf02 Aug 18, 2022
Maintainer Author

That's my point. For example, what about identifiers containing "?

These could be easily handled by escaping them, with \".
(By the way, the same argument could be made with today's Dedukti mechanism of escaping: What do you do when you have an identifier containing "|}"?)

01mf02 Aug 18, 2022
Maintainer Author

We could optimize your representation by removing objects with field "v" and "t" for products and just having an array instead, no?

Possible, yes, but this makes it harder to say: in this dependent product, we do not give a variable name, and in this lambda-abstraction, we do not give a type. Or combinations of these.

francoisthire Aug 18, 2022
Maintainer

These could be easily handled by escaping them, with ".
(By the way, the same argument could be made with today's Dedukti mechanism of escaping: What do you do when you have an identifier containing "|}"?)

We have implemented the easy version. But the point would be to allow "{xxx|" so that the enclosing character is "|xxx}". This way you never need an escaping mechanism. Note also the choice of "{|" so that this trick never need to be used.

Possible, yes, but this makes it harder to say: in this dependent product, we do not give a variable name, and in this lambda-abstraction, we do not give a type. Or combinations of these.

If you want both a format for interoperability and readability I think your lisp proposition is going to get the worse of the two worlds: 1) Readibility is worse than the current syntax 2) It adds an overhead on the parsing.

Are we looking for a format of interoperability which is both "readable" and "fast" to parse? This sounds a bit weird to me because we know from experience that those are two opposite directions.

From what I understand, the interoperability format does not need to be readable. What we need is to be able to get readable term from this format which is different and the cost is computation.

The same way we never read terms generated by Coq, we never read the format for interoperability. Or we read OCaml code and never the assembly code generated.

However, for debugging purpose, it may be possible to have the necessity to read it. But in that case, it would be someone that knows what he is doing. This is why I think the JSON a not so bad trade-off for that and we assume all the information are explicit.

Considering your benchmark, you did not put units, nor described what were type checked which makes hard to validate your claim. I also tend to believe that your dataset is biased because you have taken libraries for which is known there is little computation:

The Matita library exported by Ali Assaf almost never used computations (hence the time you observe). One lemma requires a lot of computation on Dedukti because of a bad reduction strategy. This also explains why you only have a x3 factor
Same remark for the two others libraries (the only computation involved should be beta-reduction).

The encoding in Dedukti requires some computation which is very easy and linear with respect to the size of the term. Hence, I am not surprised of those results.

I know there are libraries out there which rely a lot more on computations (mathcomp for example) for which the parsing time may be a magnitude order (or 2) less than the computation.

Globally, we can expect parsing being in linear size of the original, which is not the case for computation. Nonetheless, some care need to be done for parsing. I prefer by far thinking about the users of the standard.

My experience with Dedukti is that the current syntax is already hard to debug, so using JSON or lisp will make things worse. So instead of having one syntax, maybe having two is the best trade-off if we want to have one format which is efficient for parsing and another one which is easier to read.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON #6

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

JSON #6

01mf02 Aug 8, 2022 Maintainer

Replies: 3 comments · 8 replies

01mf02 Aug 8, 2022 Maintainer Author

dwarfmaster Aug 8, 2022 Maintainer

01mf02 Aug 10, 2022 Maintainer Author

francoisthire Aug 8, 2022 Maintainer

francoisthire Aug 13, 2022 Maintainer

01mf02 Aug 18, 2022 Maintainer Author

01mf02 Aug 18, 2022 Maintainer Author

01mf02 Aug 18, 2022 Maintainer Author

francoisthire Aug 18, 2022 Maintainer

01mf02
Aug 8, 2022
Maintainer

Replies: 3 comments 8 replies

01mf02
Aug 8, 2022
Maintainer Author

dwarfmaster
Aug 8, 2022
Maintainer

01mf02 Aug 10, 2022
Maintainer Author

francoisthire
Aug 8, 2022
Maintainer

francoisthire Aug 13, 2022
Maintainer

01mf02 Aug 18, 2022
Maintainer Author

01mf02 Aug 18, 2022
Maintainer Author

01mf02 Aug 18, 2022
Maintainer Author

francoisthire Aug 18, 2022
Maintainer