Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

treat Chinese punctuation marks as word separator #10853

Open
zhangzq opened this issue May 6, 2024 · 2 comments
Open

treat Chinese punctuation marks as word separator #10853

zhangzq opened this issue May 6, 2024 · 2 comments
Labels
enhancement a request to enhance doxygen, not a bug

Comments

@zhangzq
Copy link

zhangzq commented May 6, 2024

Describe the bug
The Chinese punctuation marks are not recognized as word separators.

def fun1(one, *two):
    pass

## 同 #fun1。
# 同 #fun1 。   
# same as #fun1.
def fun2(one, *, two):
    pass

The first #fun1 is not recognized. I can add space before , but it will also generate space before which looks ugly.

Screenshots

image

To Reproduce

As above and default doxyfile.

Expected behavior

Treat #fun1。 like #fun1.

not only ,but all Chinese punctuation marks,like ,:()?!“”, etc.

Version
1.11.0 (696ee5ef4954fb8a098c028ff7df0ddc72dfe83d*)

@albert-github albert-github added the enhancement a request to enhance doxygen, not a bug label May 6, 2024
@albert-github
Copy link
Collaborator

This is quite a difficult but interesting problem especially to implement for people that don't speak l the languages with different punctuation rules.

Some thoughts / questions:

This means that on a lot of places the different punctuation systems should be known as now most of the time the entire range \x80-\xFF is included in e.g. the lexer rules (and on other places).
(as a small example:

ID        [$a-z_A-Z\x80-\xFF][$a-z_A-Z0-9\x80-\xFF]*

meaning that for the ID, e.g. variable / function name but not limited to this, the ACSII characters $, _, a-z, A-Z and all "non"-ASCII characters can be used as first character and as following characters besides the mentioned characters also digits are possible).

As the cutting / pasting of non ASCII characters is quite tricky:

  • Can you please attach a, small, self contained example (source+configuration file in a, compressed, tar or zip file!) that allows us to reproduce the problem? Please don't add external links as they might not be persistent (also references to GitHub repositories are considered non persistent).

Probably the page https://en.wikipedia.org/wiki/Chinese_punctuation gives a good overview already

  • can you confirm this?

In the mentioned wikipedia page I see: Enumeration comma ( 、 ), in he function trWriteList (translator_cn.h) I see that the "normal" comma , is used. Wouldn't it be better to use here in the enumeration the "Enumeration comma"?

In the function DefinitionImpl::_setBriefDescription (definition.cpp) we see that the function needsPunctuation() is used and for Chinese the returned value is false so no automatic (Chines) full stop is added.

  • should for Chinese also a Chinese full stop be added when the sentence does not end with one of the Chines punctuation characters equivalent to the mentioned characters?

@zhangzq
Copy link
Author

zhangzq commented May 7, 2024

The wiki page is useful, the mentioned punctuation marks should be treated as word stops. These characters can't be parts of variable/function names.

I think variable/function/etc names shouldn't contain non-ascii characters. I know some language support non-ascii as names, but nobody uses them in serious programming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement a request to enhance doxygen, not a bug
Projects
None yet
Development

No branches or pull requests

2 participants