-
Notifications
You must be signed in to change notification settings - Fork 16
/
float_fixed_point.theory.txt
167 lines (143 loc) · 9.91 KB
/
float_fixed_point.theory.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
FLOAT_FIXED_POINT
ALTERNATIVES ==> #Floating point numbers:
# - most common
# - implemented hardware, i.e. faster
# - low space
#Fixed point numbers:
# - lowest space
# - least variable precision
#String of digits:
# - infinite precision
# - higest space
#Rational numbers:
# - best precision for rational numbers (not irrational) with repeating decimals
#Base 2 vs base 10:
# - base 2 more efficient, but decimal rounding errors to represent base 10
/=+===============================+=\
/ : : \
)==: TYPES :==(
\ :_______________________________: /
\=+===============================+=/
REPRESENTATIONS ==> #How Q is represented:
# - scientific notation:
# - m * bⁿ
# - b∈ N and m,n∈ Z
# - logarithmic number system (LNS)
# - bⁿ
# - b∈ N and n∈ Q
# - string of digits
# - allows for arbitrary precision (see doc for integers)
# - fractions:
# - two integers: nominator|denominator
RADIX ==> #Can be:
# - base 2 (binary)
# - more commonly used
# - more compact
# - base 10 (decimal)
#Using 2 can lead to imprecise representation of base-10 decimal parts:
# - in string of digits
# - not in fractions
# - in scientific notation: not radix of m|n|b, but the value of b itself
FLOATING-POINT VS #Both represent with scientific notation m * bⁿ
FIXED-POINT ==> #Difference is whether b is:
# - floating-point: variable, i.e. stored in number
# - fixed-point: constant, i.e. stored in type ("scaling factor")
FIXED-POINT NOTATIONS #Notations:
==> # - QUINT1.UINT2
# - QUINT2
# - fxUINT1.UINT3
#Where:
# - UINT1: number of digits before radix point
# - UINT2: number of digits after radix point
# - UINT3: UINT1 + UINT2
IMPLEMENTATION ==> #Floating points are usually the only ones with hardware support
/=+===============================+=\
/ : : \
)==: IEEE 754 :==(
\ :_______________________________: /
\=+===============================+=/
IEEE 754 STANDARD ==> #Commonly used to represent floating point.
#Three fields, contiguous in memory:
# - sign bit: +|- (0|1)
# - exponent
# - signedness:
# - either 2's complement or offset binary
# - i.e. one bit is for signedness
# - maximum exponent is reserved for infinity|NaN. Real one is 1 below
# - minimum exponent is reserved for 0|denormalized number. Real one is 2 above
# - significand|mantissa:
# - only decimal digits after implicit 1.* ("hidden|implicit bit")
# - i.e. >=1 and <2
#Initially 1985, but extended in:
# - 2008: adds binary16|128 and decimal32|64|128
# - 2019
DENORMALIZED NUMBERS ==>#Minimum exponent, significand not 0
#Significand is implied to start with 0.* instead of 1.*
# - i.e. >0 and <1
# - i.e. better precision for very small 0.*
#Also called denormal|subnormal
ZERO ==> #Minimum exponent, significand 0
#Sign bit allows +0|-0
INFINITY ==> #Maximum exponent, significand 0
#Sign bit allows +∞|-∞
NAN/NOT-A-NUMBER ==> #Maximum exponent, significand not 0
#Meant to represent computation error
# - can use significand as error code
#All numbers != NaN, including itself
#sNaN|signaling NaN:
# - raise an exception
# - e.g. overflow|underflow
# - as opposed to quiet NaN
# - e.g. division by 0 or square root of -1
# - determined by first bit of significand
PRECISION ==> #Max [safe] integer:
# - above this, integers will not be precise
# - when significand full (including implicit 1) and using the right exponent to make it integer
# - 2 ** (SignificandBits + 1)
#Min [safe] integer:
# - same with signed bit
# - -MaxInteger
#Epsilon:
# - smallest difference|increment between two significands
# - i.e. max precision without exponation
# - 2 ** -SignificandBits
#Exponent bias:
# - max value of the exponent field, minus 1
# - 2 ** (ExponentBits - 1) - 1
#Max exponent:
# - how many base-10 zeros can be appended
# - base ** ExponentBias
#Min exponent:
# - how many base-10 decimal zeros can be prepended
# - base ** -(ExponentBias - 1)
#Max value:
# - above this, Infinity
# - 2 * MaxExponent
#Min value (denormalized):
# - above this, -Infinity
# - Epsilon * MinExponent
LONG DOUBLE ==> #Is not standard. Differences:
# - no implied 1
# - no denormalized numbers
# - Infinity|NaN slightly different
FORMATS ==> # STANDARD NAME RADIX SIZE SIGNIFICAND EXPONENT MIN|MAX EPSILON MAX
# (bits) (bits) INTEGER EXPONENT
# no bfloat16 2 16 7 8 bits 64 7e-3 2e38
# binary16 half 2 16 10 5 bits 2e3 9e-4 3e4
# binary32 float|single 2 32 23 8 bits 2e7 1e-7 2e38
# binary64 double 2 64 52 11 bits 9e15 2e-16 9e307
# no extended|long double 2 80 64 15 bits 2e19 5e-20 6e4931
# binary128 quad[ruple] 2 128 112 15 bits 1e34 2e-34 6e4931
# binary256 octuple 2 256 236 19 bits 2e71 9e-72 8e78912
# decimal32 10 32 7 ±95 1e7 1e-7 1e95
# decimal64 10 64 16 ±383 1e16 1e-16 1e383
# decimal128 10 128 34 ±6143 1e34 1e-34 1e6143
IMPLEMENTATION ==> #In CPUs, done by a FPU (floating-point unit, ou "math coprocesser")
#By formats:
# - bfloat16: high-end Intel CPUs
# - binary16|32|64: most CPUs
# - long double: x86 CPUs
# - binary128: some non-mainstream CPUs, some compilers with flags
# - binary256: no hardware support