-
Notifications
You must be signed in to change notification settings - Fork 16
/
utf8.format.txt
68 lines (59 loc) · 4.01 KB
/
utf8.format.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
UTF-8
UTF-8 ==> #Unicode encoding.
#Pros:
# - ASCII-compatible
# - ASCII require only 1 character
#Cons:
# - U+0800-Uffff require 3 characters
# - Slower to encode|decode
# - Human cannot easily guess codepoint from bytes
#Unix (except Mac OS X) uses UTF-8
CESU-8 ==> #Unicode encoding mixing UTF-8 and UTF-16. Rarely used.
ENDIANNESS ==> #Applies when iterating through groups of bytes.
#But not when iterating through bytes in a group.
BOM ==> #Byte Order Mark is therefore useless and discouraged.
#Can create issues with shebangs, etc.
BYTE -> CODEPOINT ==> # 00-7F (U+000000-U+00007f)
# C2-DF 80-BF (U+000080-U+0007ff)
# E0 A0-BF 80-BF (U+000800-U+000fff)
# E1-EF 80-BF 80-BF (U+001000-U+00ffff)
# F0 90-BF 80-BF 80-BF (U+010000-U+03ffff)
# F1-F3 80-BF 80-BF 80-BF (U+040000-U+0fffff)
# F4 80-BF 80-BF 80-BF (U+100000-U+10ffff)
INVALID BYTES ==> #Any other is invalid:
# - First byte C0-C1 or F5-FF
# - Wrong next bytes
#Either:
# - fails
# - replaced by � (U+FFFD) ? (U+003f) ¿ (U+00bf) or a rectangle with Unicode codepoint number
# - displayed with another charset
BYTE -> CODEPOINT #Explanation:
CONVERSION ==> # - ABCD[EF[GH]]: bytes, hex char by hex char
# - abcd[ef]: Unicode codepoint, digit by digit
#00-7F sequence (1 byte): as is
#C2-DF sequence (2-bytes):
# a: 0
# b: (A-12)*4 + Floor(B/4)
# c: (B%4)*4 + (C-8)
# d: D
#E0-EF sequence (3-bytes):
# a: B
# b: (C-8)*4 + Floor(D/4)
# c: (D%4)*4 + (E-8)
# d: F
#F0-F4 sequence (4-bytes):
# a: Floor(A/4)
# b: (B%4)*4 + (C-8)
# c: D
# d: (E-8)*4 + Floor(F/4)
# e: (F%4)*4 + (G-8)
# f: H
CODEPOINT -> BYTE #U+000000-U+00007f: as is
CONVERSION ==> #U+000080-U+0007ff:
# A: (b-(b%4))/4 + 12
# B: (b%4)*4 + (c-(c%4))/4
# C: c%4 + 8
# D: d
#Others: reverse the above (document if needed)