-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Characters #3
Comments
I agree that this annex 31 is slightly obscure, but it consists in the set of unicode charcters that are suitable for programming (it excludes greek comma which looks like semi colon and other nasty glyphs). IINM, Rust and C++ use that set of characters. |
Ok, so to get some understanding: Which of the following characters does annex 31 include?
|
As a confirmed rustacean you may use this library https://github.com/dtolnay/unicode-ident :) My guess is that (1) is not included, (2) is included, (3) I don't know, (4) is included, (5) is included, (6) is not included (to be checked), (7) is included. |
I understand the rationale for this, but for the moment, the target of this standard are programming languages such as Coq, Agda, HOL. Shouldn't we try to focus on those languages before starting to check whether we are compatible with Rust/C++ which AFAIK, are not concerned by this standard? What is the benefit of being compatible with Rust/C++ ? |
I'm not saying we should be compatible with Rust and C++, I'm rather saying that some people (the ISO people) have thought hard about which unicode character should be used in a programming language, and apparently it's standard enough so that Rust and C++ use that standard, so why bother and choose another class of characters for Dedukti? |
As mentioned here: #2 (comment) the point of a standard is also to be adopted. So I would suggest that we use the class of characters that will ease the adoption of the standard. If we are too restrictive for whatever reason, then we won't be adopted. But not only, if people, especially as developers find the standard too difficult, or too hard to understand, they might not want use it. AFAIK, people concerned by this standard are mostly researchers, not software engineers which are paid to understand complicated stuff like Unicode standard annex 31. I don't know whether we should adopt a new class, but I don't see the benefit of having the same class of characters as Rust or C++ for this standard. But I clearly see some issues with it:
|
Your two points make perfect sense to me. On the other hand, unicode is about allowing non english speakers to use the characters they like, so it's kind of a tradeoff implementation v "ethics"? I propose to adopt a restricted class for now (say ascii), and we can later on extend that class without problem, when the annex becomes widespread enough. I think there is no downside to this approach, since we can extend the class of characters without breaking compatibility. |
Indeed, extending a class is something easier to do than the opposite. Something that can help the adoption is when a class of characters is well supported by the various tools. Maybe it is not the case today and it will be later on?
This problem is already there everywhere. But I kind of disagree a priori. For performances reasons, there is an interest to not encode characters on As you said, it is easy to extend if at some point the necessity arises. |
Is it? https://www.reddit.com/r/rust/comments/vkjlxc/media_cute_compiler_easter_egg/ |
Given the complexity of the annex, I would suggest that:
We forbid |
I also think that it's better to start out more restrictively, that is, no full Unicode for now. |
Are you by any chance a regular reader of the Rust Reddit? :) |
There is a semantical question on the nature of If we treat |
I have the impression that there is a rule that if you have two different ways to write the same things, it is generally not a good idea. However, I agree that in #191, I implicitly says that if I write:
Then the file is assumed to be named I think that my insight is that we should treat |
I'd say that making |
Performance-wise, it seems that to identify |
A user printing into the standard needs to know when to put
And use the function |
This depends on the internal representation of identifiers. If |
In that case, wouldn't a parser/lexer first scan using the regex for |
I can say that it is quite easy to equate Other question: should we be able to nest à la |
Is there any argument to equate those two names? Because I am not sure we should.
I would say no, except if there is a good reason for that. |
The "Syntax" section states: "The standard characters for the language are defined by Unicode standard annex 31".
I have tried to find out what this means, and I failed. Some intuition would be nice to provide here. Does this mean something like
[a-zA-Z0-9]
? Or does this really mean all of Unicode?I would propose something like this: An identifier can be any Unicode character, except for a set of reserved characters (
:
,.
,[
, ...).The text was updated successfully, but these errors were encountered: