Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BWTC32Key-generated BOM+CJK test string #258

Open
stgiga opened this issue Aug 12, 2023 · 0 comments
Open

Add BWTC32Key-generated BOM+CJK test string #258

stgiga opened this issue Aug 12, 2023 · 0 comments

Comments

@stgiga
Copy link

stgiga commented Aug 12, 2023

A program I wrote is called BWTC32Key (http://b3k.sourceforge.io), and it compresses, then encrypts, and then encodes into Base32768 (15 bits of data are stored per Plane 0 Unicode character, which in UTF16 is ~94% efficient) whatever is fed into it. To make its .B3K files open as a text document in text editors, the first character of them is U+FEFF. Now, the special part is that this program gets the 32768 characters needed to store data efficiently by using characters from U+3400-U+4CFF, U+4E00-U+9EFF, and U+AC00-U+C1FF, with U+C200 serving as the equivalent of the equals sign in Base64 (except only 1 is needed), and U+4D00 is the second half of the header (U+FEFF is the first half), and U+4D01 is the terminator. U+3400-U+4D01 are all in "CJK Unified Ideographs Extension A", U+4E00-U+9EFF is the first 20736 characters of the "CJK Unified Ideographs" block, and U+AC00-U+C200 are the first 5633 characters of the "Hangul Syllables" block.

So the output strings of the program are Unicode strings containing many rare CJKV ideographs not used in modern times (especially Extension A, and a significant chunk of the big block), intermixed with Hangul, with no spaces or punctuation besides U+FEFF at the beginning. I've tried using this for a password, but even without U+FEFF, no site I've found is willing to accept it, and they also hate long passwords. I've ended up having to apply UTF-7 to the output of this program and then feed that to the sites after truncating it to fit their limits. Given this outcome, one would surmise that these types of strings would be an excellent string to add to this program's toolkit of test strings. Here is a 105-character string from the program, made from a short drum MIDI I wrote:
BLNS-entry.txt

Of course, you can feed BWTC32Key a zero-byte file if you are so inclined and want a shorter string. THAT one will only be 9 characters (or 8 without U+FEFF).

My homemade fork of GNU Unifont (UnifontEX, available here: http://github.com/stgiga/UnifontEX) is able to display basically everything in the BLNS, unlike stock Unifont. So, stock Unifont fails at the BLNS, but my modded one (UnifontEX) does pass it.

I think from all of this you can gather that I am a Unicode geek. I hope that more sites can handle Unicode properly, especially in password boxes given the fact that it is getting easier to quickly crack 8-character ASCII passwords if you have the money to shell out on a fully-loaded multiple-RTX4090 rig with i9 or Threadripper. Also, I have done other tests on these passwords besides brute-force, and they hold up well. Hopefully the BLNS adding these strings can help improve password code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant