Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More (useless) code point normalization #2

Open
Artoria2e5 opened this issue Sep 19, 2016 · 0 comments
Open

More (useless) code point normalization #2

Artoria2e5 opened this issue Sep 19, 2016 · 0 comments

Comments

@Artoria2e5
Copy link
Contributor

Artoria2e5 commented Sep 19, 2016

  • Big5-HKSCS EUDA→PUA: https://github.com/stanfordnlp/CoreNLP/wiki/Chinese-Private-Use-Area-code-points (crappy script indeed)
  • Reverse the Source Separation Rule as if we are facing CJK extension blocks and doing some regular normalization (which gets rid of compatibility code points). Might be interesting for feeding the LM text from different CJK languages to obtain blocky soup...
    • Idea: We can let the bot generate pseudo-pseudo-Chinese (偽偽中国語, a.k.a. zh@face_white) by feeding it Japanese text without kana and normal Chinese. This normalization may help the LM stir things further by explicitly showing character equivalence, although some extra intervention to revert Han simplification on both scripts can be more helpful.
    • Somehow similar to unifying zh-Hans and zh-Hant with OpenCC, but really more ambitious and playful.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant