Characters #3

01mf02 · 2022-07-07T10:34:37Z

The "Syntax" section states: "The standard characters for the language are defined by Unicode standard annex 31".

I have tried to find out what this means, and I failed. Some intuition would be nice to provide here. Does this mean something like [a-zA-Z0-9]? Or does this really mean all of Unicode?

I would propose something like this: An identifier can be any Unicode character, except for a set of reserved characters (:, ., [, ...).

The text was updated successfully, but these errors were encountered:

gabrielhdt · 2022-07-07T11:44:35Z

I agree that this annex 31 is slightly obscure, but it consists in the set of unicode charcters that are suitable for programming (it excludes greek comma which looks like semi colon and other nasty glyphs). IINM, Rust and C++ use that set of characters.

01mf02 · 2022-07-07T12:01:58Z

Ok, so to get some understanding: Which of the following characters does annex 31 include?

(
😀
Ä
?
A
:
y̆ (GitHub renders this in a funky way, let's try again: y̆)

gabrielhdt · 2022-07-07T14:59:56Z

As a confirmed rustacean you may use this library https://github.com/dtolnay/unicode-ident :) My guess is that (1) is not included, (2) is included, (3) I don't know, (4) is included, (5) is included, (6) is not included (to be checked), (7) is included.

francoisthire · 2022-07-08T07:04:54Z

I understand the rationale for this, but for the moment, the target of this standard are programming languages such as Coq, Agda, HOL. Shouldn't we try to focus on those languages before starting to check whether we are compatible with Rust/C++ which AFAIK, are not concerned by this standard? What is the benefit of being compatible with Rust/C++ ?

gabrielhdt · 2022-07-08T07:21:20Z

I'm not saying we should be compatible with Rust and C++, I'm rather saying that some people (the ISO people) have thought hard about which unicode character should be used in a programming language, and apparently it's standard enough so that Rust and C++ use that standard, so why bother and choose another class of characters for Dedukti?

francoisthire · 2022-07-08T07:27:09Z

As mentioned here: #2 (comment) the point of a standard is also to be adopted. So I would suggest that we use the class of characters that will ease the adoption of the standard. If we are too restrictive for whatever reason, then we won't be adopted. But not only, if people, especially as developers find the standard too difficult, or too hard to understand, they might not want use it.

AFAIK, people concerned by this standard are mostly researchers, not software engineers which are paid to understand complicated stuff like Unicode standard annex 31.

I don't know whether we should adopt a new class, but I don't see the benefit of having the same class of characters as Rust or C++ for this standard. But I clearly see some issues with it:

Complexity to write a parser compatible with it
Understand which characters are allowed or not when a printer is written

gabrielhdt · 2022-07-08T07:46:25Z

Your two points make perfect sense to me. On the other hand, unicode is about allowing non english speakers to use the characters they like, so it's kind of a tradeoff implementation v "ethics"?

I propose to adopt a restricted class for now (say ascii), and we can later on extend that class without problem, when the annex becomes widespread enough. I think there is no downside to this approach, since we can extend the class of characters without breaking compatibility.

francoisthire · 2022-07-08T08:44:35Z

Indeed, extending a class is something easier to do than the opposite.

Something that can help the adoption is when a class of characters is well supported by the various tools. Maybe it is not the case today and it will be later on?

Your two points make perfect sense to me. On the other hand, unicode is about allowing non english speakers to use the characters they like, so it's kind of a tradeoff implementation v "ethics"?

This problem is already there everywhere. But I kind of disagree a priori. For performances reasons, there is an interest to not encode characters on x bytes because we have to include all the characters in the world. Instead, a subset needs to be chosen and currently we use mostly Latin alphabet. Ok, you can say it is a bit discriminatory, but we are both talking in english while both are mother tongue is french. This is a bit the same thing, we need to find a common set of letters that everyone can use, and for the moment Latin alphabet is quite well spread.

As you said, it is easy to extend if at some point the necessity arises.

francoisthire · 2022-07-08T11:09:10Z

(2) is included

Is it? https://www.reddit.com/r/rust/comments/vkjlxc/media_cute_compiler_easter_egg/

francoisthire · 2022-07-09T07:57:32Z

Given the complexity of the annex, I would suggest that:

By default, only a simple regexp allowing ASCII characters and some other characters is allowed
Unicode is allowed only using the special syntax {| |}

We forbid |} as a valid name using the special syntax.

01mf02 · 2022-07-15T08:03:39Z

I also think that it's better to start out more restrictively, that is, no full Unicode for now.

01mf02 · 2022-07-15T08:04:52Z

(2) is included

Is it? https://www.reddit.com/r/rust/comments/vkjlxc/media_cute_compiler_easter_egg/

Are you by any chance a regular reader of the Rust Reddit? :)

01mf02 · 2022-07-15T08:05:54Z

There is a semantical question on the nature of {|...|}: Should we treat {|abc|} equal to simply abc? Or are these two distinct?

If we treat {|...|} simply as an "escaping mechanism", as done in the module proposal of François, then I believe that indeed, {|abc|} should be equal to abc.

francoisthire · 2022-07-15T08:49:29Z

If we treat {|...|} simply as an "escaping mechanism", as done in Deducteam/Dedukti#191 (comment), then I believe that indeed, {|abc|} should be equal to abc.

I have the impression that there is a rule that if you have two different ways to write the same things, it is generally not a good idea. However, I agree that in #191, I implicitly says that if I write:

#REQUIRE A.{|B|} as {|B|}

Then the file is assumed to be named B while no commitment is made on all the futures references to identifiers coming from file A/B.dk. We could use either {|B|}.foo or B.foo. However, the second one requires to make {|B|} and B equal while the first one works if they are separated.

I think that my insight is that we should treat {|B|} and B as two separated constants. But maybe there are better arguments to make them equal?

gabrielhdt · 2022-07-15T10:04:35Z

I'd say that making {|B|} different from B puts the responsibility of handling properly escaping on the user, while the contrary puts the responsibility on the implementation. I have no clear idea which one is better, though.

gabrielhdt · 2022-07-15T10:06:56Z

Performance-wise, it seems that to identify {|B|} and B, B must be scanned to see if it can be coerced to a regular identifier.

francoisthire · 2022-07-15T10:12:20Z

I'd say that making {|B|} different from B puts the responsibility of handling properly escaping on the user, while the contrary puts the responsibility on the implementation. I have no clear idea which one is better, though.

A user printing into the standard needs to know when to put {| |} in any case (for example it needs to check whether the identifier will contain Unicode characters). I think for consistency, if the user writes a printer, he will do this way:

let sanitize name = if needs_to_be_sanitized name then Format.asprintf "{|%s|}" name else name

And use the function sanitize everywhere. So for the user printing stuff into the standard, I have the impression it would not change much whether those names are identified or not.

francoisthire · 2022-07-15T10:14:10Z

Performance-wise, it seems that to identify {|B|} and B, B must be scanned to see if it can be coerced to a regular identifier.

This depends on the internal representation of identifiers. If {|B|} and B are equated, then we could simply removed {| and |} in the internal representation, no?

gabrielhdt · 2022-07-15T10:24:23Z

In that case, wouldn't a parser/lexer first scan using the regex for {| ... |} and then look if they could be removed? If it is the case, it results into two scan of each {| ... |}. That said, these considerations may be nothing more than anticipated optimisation.

01mf02 · 2022-07-15T16:42:13Z

I can say that it is quite easy to equate {|B|} and B in a performant fashion in the parser.

Other question: should we be able to nest à la {|B {|C|} |}, as is currently possible for comments?

francoisthire · 2022-07-15T16:45:22Z

I can say that it is quite easy to equate {|B|} and B in a performant fashion in the parser.

Is there any argument to equate those two names? Because I am not sure we should.

should we be able to nest à la {|B {|C|} |}, as is currently possible for comments?

I would say no, except if there is a good reason for that.

francoisthire added the 2.0 label Oct 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Characters #3

Characters #3

01mf02 commented Jul 7, 2022

gabrielhdt commented Jul 7, 2022

01mf02 commented Jul 7, 2022 •

edited

Loading

gabrielhdt commented Jul 7, 2022

francoisthire commented Jul 8, 2022 •

edited

Loading

gabrielhdt commented Jul 8, 2022

francoisthire commented Jul 8, 2022 •

edited

Loading

gabrielhdt commented Jul 8, 2022

francoisthire commented Jul 8, 2022

francoisthire commented Jul 8, 2022

francoisthire commented Jul 9, 2022

01mf02 commented Jul 15, 2022

01mf02 commented Jul 15, 2022 •

edited

Loading

01mf02 commented Jul 15, 2022 •

edited

Loading

francoisthire commented Jul 15, 2022

gabrielhdt commented Jul 15, 2022

gabrielhdt commented Jul 15, 2022

francoisthire commented Jul 15, 2022

francoisthire commented Jul 15, 2022

gabrielhdt commented Jul 15, 2022

01mf02 commented Jul 15, 2022

francoisthire commented Jul 15, 2022

Characters #3

Characters #3

Comments

01mf02 commented Jul 7, 2022

gabrielhdt commented Jul 7, 2022

01mf02 commented Jul 7, 2022 • edited Loading

gabrielhdt commented Jul 7, 2022

francoisthire commented Jul 8, 2022 • edited Loading

gabrielhdt commented Jul 8, 2022

francoisthire commented Jul 8, 2022 • edited Loading

gabrielhdt commented Jul 8, 2022

francoisthire commented Jul 8, 2022

francoisthire commented Jul 8, 2022

francoisthire commented Jul 9, 2022

01mf02 commented Jul 15, 2022

01mf02 commented Jul 15, 2022 • edited Loading

01mf02 commented Jul 15, 2022 • edited Loading

francoisthire commented Jul 15, 2022

gabrielhdt commented Jul 15, 2022

gabrielhdt commented Jul 15, 2022

francoisthire commented Jul 15, 2022

francoisthire commented Jul 15, 2022

gabrielhdt commented Jul 15, 2022

01mf02 commented Jul 15, 2022

francoisthire commented Jul 15, 2022

01mf02 commented Jul 7, 2022 •

edited

Loading

francoisthire commented Jul 8, 2022 •

edited

Loading

francoisthire commented Jul 8, 2022 •

edited

Loading

01mf02 commented Jul 15, 2022 •

edited

Loading

01mf02 commented Jul 15, 2022 •

edited

Loading