Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters #3

Open
01mf02 opened this issue Jul 7, 2022 · 21 comments
Open

Characters #3

01mf02 opened this issue Jul 7, 2022 · 21 comments
Labels

Comments

@01mf02
Copy link
Collaborator

01mf02 commented Jul 7, 2022

The "Syntax" section states: "The standard characters for the language are defined by Unicode standard annex 31".

I have tried to find out what this means, and I failed. Some intuition would be nice to provide here. Does this mean something like [a-zA-Z0-9]? Or does this really mean all of Unicode?

I would propose something like this: An identifier can be any Unicode character, except for a set of reserved characters (:, ., [, ...).

@gabrielhdt
Copy link
Member

I agree that this annex 31 is slightly obscure, but it consists in the set of unicode charcters that are suitable for programming (it excludes greek comma which looks like semi colon and other nasty glyphs). IINM, Rust and C++ use that set of characters.

@01mf02
Copy link
Collaborator Author

01mf02 commented Jul 7, 2022

Ok, so to get some understanding: Which of the following characters does annex 31 include?

  1. (
  2. 😀
  3. Ä
  4. ?
  5. A
  6. :
  7. (GitHub renders this in a funky way, let's try again: y̆)

@gabrielhdt
Copy link
Member

As a confirmed rustacean you may use this library https://github.com/dtolnay/unicode-ident :) My guess is that (1) is not included, (2) is included, (3) I don't know, (4) is included, (5) is included, (6) is not included (to be checked), (7) is included.

@francoisthire
Copy link
Collaborator

francoisthire commented Jul 8, 2022

I understand the rationale for this, but for the moment, the target of this standard are programming languages such as Coq, Agda, HOL. Shouldn't we try to focus on those languages before starting to check whether we are compatible with Rust/C++ which AFAIK, are not concerned by this standard? What is the benefit of being compatible with Rust/C++ ?

@gabrielhdt
Copy link
Member

I'm not saying we should be compatible with Rust and C++, I'm rather saying that some people (the ISO people) have thought hard about which unicode character should be used in a programming language, and apparently it's standard enough so that Rust and C++ use that standard, so why bother and choose another class of characters for Dedukti?

@francoisthire
Copy link
Collaborator

francoisthire commented Jul 8, 2022

As mentioned here: #2 (comment) the point of a standard is also to be adopted. So I would suggest that we use the class of characters that will ease the adoption of the standard. If we are too restrictive for whatever reason, then we won't be adopted. But not only, if people, especially as developers find the standard too difficult, or too hard to understand, they might not want use it.

AFAIK, people concerned by this standard are mostly researchers, not software engineers which are paid to understand complicated stuff like Unicode standard annex 31.

I don't know whether we should adopt a new class, but I don't see the benefit of having the same class of characters as Rust or C++ for this standard. But I clearly see some issues with it:

  • Complexity to write a parser compatible with it
  • Understand which characters are allowed or not when a printer is written

@gabrielhdt
Copy link
Member

Your two points make perfect sense to me. On the other hand, unicode is about allowing non english speakers to use the characters they like, so it's kind of a tradeoff implementation v "ethics"?

I propose to adopt a restricted class for now (say ascii), and we can later on extend that class without problem, when the annex becomes widespread enough. I think there is no downside to this approach, since we can extend the class of characters without breaking compatibility.

@francoisthire
Copy link
Collaborator

Indeed, extending a class is something easier to do than the opposite.

Something that can help the adoption is when a class of characters is well supported by the various tools. Maybe it is not the case today and it will be later on?

Your two points make perfect sense to me. On the other hand, unicode is about allowing non english speakers to use the characters they like, so it's kind of a tradeoff implementation v "ethics"?

This problem is already there everywhere. But I kind of disagree a priori. For performances reasons, there is an interest to not encode characters on x bytes because we have to include all the characters in the world. Instead, a subset needs to be chosen and currently we use mostly Latin alphabet. Ok, you can say it is a bit discriminatory, but we are both talking in english while both are mother tongue is french. This is a bit the same thing, we need to find a common set of letters that everyone can use, and for the moment Latin alphabet is quite well spread.

As you said, it is easy to extend if at some point the necessity arises.

@francoisthire
Copy link
Collaborator

(2) is included

Is it? https://www.reddit.com/r/rust/comments/vkjlxc/media_cute_compiler_easter_egg/

@francoisthire
Copy link
Collaborator

Given the complexity of the annex, I would suggest that:

  • By default, only a simple regexp allowing ASCII characters and some other characters is allowed
  • Unicode is allowed only using the special syntax {| |}

We forbid |} as a valid name using the special syntax.

@01mf02
Copy link
Collaborator Author

01mf02 commented Jul 15, 2022

I also think that it's better to start out more restrictively, that is, no full Unicode for now.

@01mf02
Copy link
Collaborator Author

01mf02 commented Jul 15, 2022

(2) is included

Is it? https://www.reddit.com/r/rust/comments/vkjlxc/media_cute_compiler_easter_egg/

Are you by any chance a regular reader of the Rust Reddit? :)

@01mf02
Copy link
Collaborator Author

01mf02 commented Jul 15, 2022

There is a semantical question on the nature of {|...|}: Should we treat {|abc|} equal to simply abc? Or are these two distinct?

If we treat {|...|} simply as an "escaping mechanism", as done in the module proposal of François, then I believe that indeed, {|abc|} should be equal to abc.

@francoisthire
Copy link
Collaborator

If we treat {|...|} simply as an "escaping mechanism", as done in Deducteam/Dedukti#191 (comment), then I believe that indeed, {|abc|} should be equal to abc.

I have the impression that there is a rule that if you have two different ways to write the same things, it is generally not a good idea. However, I agree that in #191, I implicitly says that if I write:

#REQUIRE A.{|B|} as {|B|}

Then the file is assumed to be named B while no commitment is made on all the futures references to identifiers coming from file A/B.dk. We could use either {|B|}.foo or B.foo. However, the second one requires to make {|B|} and B equal while the first one works if they are separated.

I think that my insight is that we should treat {|B|} and B as two separated constants. But maybe there are better arguments to make them equal?

@gabrielhdt
Copy link
Member

I'd say that making {|B|} different from B puts the responsibility of handling properly escaping on the user, while the contrary puts the responsibility on the implementation. I have no clear idea which one is better, though.

@gabrielhdt
Copy link
Member

Performance-wise, it seems that to identify {|B|} and B, B must be scanned to see if it can be coerced to a regular identifier.

@francoisthire
Copy link
Collaborator

I'd say that making {|B|} different from B puts the responsibility of handling properly escaping on the user, while the contrary puts the responsibility on the implementation. I have no clear idea which one is better, though.

A user printing into the standard needs to know when to put {| |} in any case (for example it needs to check whether the identifier will contain Unicode characters). I think for consistency, if the user writes a printer, he will do this way:

let sanitize name = if needs_to_be_sanitized name then Format.asprintf "{|%s|}" name else name

And use the function sanitize everywhere. So for the user printing stuff into the standard, I have the impression it would not change much whether those names are identified or not.

@francoisthire
Copy link
Collaborator

Performance-wise, it seems that to identify {|B|} and B, B must be scanned to see if it can be coerced to a regular identifier.

This depends on the internal representation of identifiers. If {|B|} and B are equated, then we could simply removed {| and |} in the internal representation, no?

@gabrielhdt
Copy link
Member

In that case, wouldn't a parser/lexer first scan using the regex for {| ... |} and then look if they could be removed? If it is the case, it results into two scan of each {| ... |}. That said, these considerations may be nothing more than anticipated optimisation.

@01mf02
Copy link
Collaborator Author

01mf02 commented Jul 15, 2022

I can say that it is quite easy to equate {|B|} and B in a performant fashion in the parser.

Other question: should we be able to nest à la {|B {|C|} |}, as is currently possible for comments?

@francoisthire
Copy link
Collaborator

I can say that it is quite easy to equate {|B|} and B in a performant fashion in the parser.

Is there any argument to equate those two names? Because I am not sure we should.

should we be able to nest à la {|B {|C|} |}, as is currently possible for comments?

I would say no, except if there is a good reason for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants