-
Notifications
You must be signed in to change notification settings - Fork 16
/
search_tree.theory.txt
375 lines (335 loc) · 27.6 KB
/
search_tree.theory.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
SEARCH_TREE
/=+===============================+=\
/ : : \
)==: ALGORITHMS :==(
\ :_______________________________: /
\=+===============================+=/
BINARY SEARCH ==> #Search within a sorted array that:
# - check if the median element is <= or >= than searched element
# - recurse over the lower|upper half
#Also called "half-interval|logarithmic search" or "binary chop"
#Average|worst time complexity: O(log₂ n)
#Space complexity: O(1)
#Variants (micro-optimizations):
# - "uniform binary search":
# - when array size is static
# - pre-compute integer offsets for each step and store in a lookup table,
# as opposed to calculate the middle index using the upper and lower bounds
# - "fibonacci search":
# - instead of checking the median element, check the highest index that is a Fibonnaci
# number
# - goal: on CPUs where << * or / is slow, as opposed to +
DATA STRUCTURES ==> # - search tree: as a tree
# - skip list: as a list of linked lists
# - trie: when values are arrays
# - rope, conc tree: when values are elements of a single array, accessed by index
#See below for more information
EXPONENTIAL SEARCH ==> #Similar to the binary search but instead with an exponentially increasing increment.
#Also called "doubling|galloping|Struzik search".
#E.g. with factor 2:
# - check index 1, 2, 4, 8, etc. until >= value
# - then recurse on the sub-array
#Same time|space complexity as binary search but:
# - streaming algorithm: can be performed on array with unknown|infinite size
# - faster if element is at beginning, slower if at end
# - better locality of reference
#Variant:
# - perform successive searches with decreasing increments
# - e.g. index **= 3 then index **= 2 then index *= 2
# - same time|space complexity
# - faster for elements at end, but slower at beginning
INTERPOLATION SEARCH ==> #Similar to the binary search but:
# - instead of picking the median, pick the most likely index, proportionally to the value
# - i.e. instead of index at 50%, use index at (value - lower) / (upper - lower)
# - "proportional" is according to the probability distribution of the values
# - i.e. the calculation above is when distribution is uniform
#Time complexity:
# - best: O(log log n) when values are following the probability distribution
# - worst: O(n)
#Goal: faster when probability distribution of values is known
FRACTIONAL CASCADING ==> #Data structure to perform a sequence of binary searches on k different arrays.
#Each array:
# - precomputes the position of each of its value if they were inserted in the next array, i.e.:
# - this can be used as fingers to speed up looking up the next array
# - each extra array search performs in O(1)
# - contains half of the values (every odd index) of the next array
# - i.e. position of original value in original array must be stored first
# - this is built from last to first, i.e. it cascades
# - reason: otherwise if next array has very different range of values, the finger found
# in current array might not be helpful at all
#Time complexity: O(log n + k) instead of O(k * log n/k)
#Space complexity: O(n) because each array has double size, and contains two extra indices
#Variants:
# - "catalog graph":
# - make each array point to several "next arrays" instead
# - i.e. arrays become nodes of a directed graph
# - goal: when order of sequence of binary searches is dynamic
# - "dynamic fractional cascading":
# - when data structure is dynamic, i.e. insert|delete are allowed
/=+===============================+=\
/ : : \
)==: SEARCH TREE :==(
\ :_______________________________: /
\=+===============================+=/
SEARCH TREE ==> #Arborescence where nodes are ordered: left children < parent < right children.
#Also called "ordered|sorted tree"
#Optimized for:
# - search
# - serial access in sorted order
#Time complexity:
# - search: O(m/2*logₘ n) where m is outdegree
# - insert|delete:
# - like search
# - first search of it, then perform constant-time insert|delete
# - tree must be balanced, otherwise:
# - height will average ᵐ√order, and complexity O(ᵐ√n)
# - can even be O(n) if search tree degenerates into a line graph
#Values can be stored either on each node or only on leaves
#Pros:
# - O(log n) operations (search|insert|delete)
# - automatically ordered, good for iteration
# - persistent data structure
# - recursive data structure
#Cons:
# - fairly "average" data structure: space|time complexity is neither best nor wosrt
# - less time efficient if small number of items
BINARY SEARCH TREE (BST) ==> #Search tree where outdegree = 2
#Time complexity (if balanced): O(log n)
LEFT-CHILD RIGHT-SIBLING (LCRS) #Tree obtained from applying the "Knuth transform" on any rooted directed tree.
TREE ==> #Knuth transform:
# - make each sibling point to next sibling
# - only keep edge from parent to first child
#Also called "child-sibling representation", "doubly-chained tree", "filial-heir chain"
#Consequences:
# - pro: result is a binary tree, which is more space efficient than a k-ary tree
# - i.e. not knowing each vertex's outdegree requires dynamic arrays, which require
# extra pointers and dope vectors
# - con: siblings must be accessed in serial access not random access
/=+===============================+=\
/ : : \
)==: SELF-BALANCING :==(
\ :_______________________________: /
\=+===============================+=/
SELF-BALANCING TREE ==> #Tree that remains balanced after any set of operations
#Also called "height-balancing tree"
#Often on search trees
B-TREE ==> #Search tree where:
# - outdegree must be between ⌈U/2⌉ and U ("order")
# - no lower limit of outdegree for root
# - node's value is an array or numbers:
# - acting as separators between each two children
# - U can be picked to try to make nodes the same size as an I/O block
#Similar to binary search tree:
# - which can be considered a b-tree of order 2
#The higher the order:
# - better when nodes are read|write sequentially
# - i.e. when siblings nodes are usually read|write together
# - e.g. a file's blocks
# - instead of O(log₂ n), performs in O(m/2*logₘ n) to find node
# (which is worst), but then iterating|inserting|deleting siblings is faster
# - worst time complexity: some space is empty
# - less frequent balancing needed
#If U is very high, can:
# - implement keys as a search tree themselves instead of a simple array
# - use delta encoding compression on keys (better space, worst time)
#Self-balancing:
# - if node full|empty, can:
# - transfer ("rotation") children between siblings
# - merge|split siblings
# - must change parent keys as well
# - merge|split child to|from parent
# - must be done recursively from node to root
# - can be done either:
# - between operations
# - during operation, as nodes are being visited
B+ TREE ==> #B-tree where:
# - values are stored only on leaves
# - each leaf contains a pointer to next leaf (i.e. linked list)
#I.e. like B-tree but with better sequential access accross several leaves.
RANDOM SEARCH TREE ==> #When shuffling nodes before inserting many of them in a search tree.
#Reason:
# - lower depth in average
# - e.g. inserting many sorted nodes (without balancing) would otherwise
# create a graph list as a subtree, increasing depth linearly
#Variants:
# - do a random pick among all possible trees after insertion of each
# permutation of the nodes.
#For a binary search tree, in average, depth will be:
# - mean: 2*log(n)+1
# - max: 4.3*log(n)
TREAP ==> #Tree of [VAL, KEY] pairs:
# - VALs behave like a search tree, usually binary
# - KEYs behave like a heap
# - self-balanced by making it a random search tree (using random KEYs)
#Goal is self-balancing a search tree
#Self-balance:
# - insertion:
# - perform binary search tree insertion using VAL
# - assign a random KEY
# - perform tree rotations on that node and its parent until KEY satisfies heap property
# - deletion:
# - perform tree rotations on that node and its right child until that node has no children
# - delete that node
# - split:
# - insert a node with VAL where split occurs and KEY being highest so that node becomes
# root
# - then remove that node, splitting tree
# - merge:
# - if max VAL of first tree if less than min VAL of second tree
# - then insert a node with VAL2 in-between, and KEY being lowest so that rotations
# performs the merge
# - then remove that node
/=+===============================+=\
/ : : \
)==: LIST :==(
\ :_______________________________: /
\=+===============================+=/
SKIP LIST ==> #List of linked lists which have similar properties|behavior as a search tree:
# - each linked list is similar to a search tree level:
# - the first linked list contains only first element and last element
# (which is a sentinel node)
# - the last linked list contains all elements from first to last
# - other contains more and more elements, following either a deterministic or
# a random factor p
# - nodes from different linked lists but with same element are contiguous in memory
# - i.e. pointing down to next linked list does not require a pointer
#Operations:
# - search:
# - goes through each linked list in order to find the highest smaller value
# - this requires backtracking one node, i.e. previous node should be kept in state
# - then goes down to next linked list
# - insert:
# - insert to lowest linked list, then use randomness to decide how many nodes to create
# in upper linked lists
# - i.e. each linked list has probability 1/p, providing lower linked list has a node
# - if values somewhat follow a random distribution, can also use them as the seed
# - i.e. similar to a search tree that self-balances with randomness (like treap)
#Pros|cons with search trees:
# - similarly to B+ trees, iteration is efficient
#"Indexable skiplist":
# - storing how much elements are skipped by each linked list's next pointer
# - goal: access by index O(log n) instead of O(n)
/=+===============================+=\
/ : : \
)==: TRIE :==(
\ :_______________________________: /
\=+===============================+=/
TRIE ==> #Search tree where:
# - keys are arrays
# - nodes store individual keys' items
# - i.e. root-to-terminal-node path contains full key
# - root node has no key item
#Also called "prefix tree"
#Goal:
# - similar as other search trees but optimized for operations of arrays
# - only efficient when keys share common prefixes
#Time complexity:
# - min|average|max O(m) where m is min|average|max array length
# - but each node step should be fast as it involves a single key item
# - as opposed e.g. to binary search tree which compares whole array at
# each step, i.e. complexity is actually O(n*log(n)) on arrays
#Keys are arrays, including:
# - strings
# - bits ("bitwise trie")
#Terminal node:
# - every leaf is a terminal node but internal nodes can be too
# - can be represented either:
# - with a flag on the node
# - sometimes called "white" (terminal) / "black" (non-terminal)
# - with an empty terminal node child
#Edges from parent to children can be represented in several ways:
# - dynamic array ("adaptive"):
# - O(n) lookup time
# - LCRS tree:
# - O(n) lookup time but more space efficient
# - binary search tree:
# - called (although confusingly) "ternary search tree"
# - O(log n) lookup time
# - often represented as a tree with outdegree 3 where:
# - left and right children are siblings (i.e. the binary search subtree)
# - middle child is actual child (i.e. the trie supertree)
# - static array:
# - with every possible combination, filled with null pointers when no child
# - O(1) lookup time but O(m) space requirement, where m is number of combinations
# - "alphabet reduction":
# - reducing number of combinations, e.g. nibble-wise (16) instead of byte-wise (256)
# - con: increases tree height
# - "array mapped trie": when using "array bit mapping" (see its doc)
# - "concurrent trie"/"ctrie":
# - using a 32 bits map and a compare-and-swap atomic CPU instruction, so that
# trie can be thread-safe and lock-free
#Operations:
# - finding nodes by key prefix
RADIX TREE ==> #Trie where nodes can store several key items, as opposed to just one.
#Goal:
# - more space efficient
# - more time efficient on reads
#Cons:
# - less space efficient if high density
# - less time efficient on writes
#Operations:
# - insertion: during initial traversal, if there is a partially matching child, split it
# - deletion: after deletion, merge single-child nodes to their child
DAFSA ==> #DAFSA (see state doc) can used to represent tries in a more space efficient way.
#As opposed to normal representation:
# - simple cycles (but not directed ones) are allowed
# - i.e. nodes can share multiple parents as long as it does not introduce directed cycles
# - i.e. space efficient when many common suffixes
BURST TRIE ==> #Trie ("access trie") where leaves ("burst") are another data structure ("burst containers").
#Goal:
# - using time|space efficiency of trie where prefixes are shared
# - while using time|space efficiency of other data structures where suffixes are unique
#Common burst containers: list, search tree.
#Different strategies on when burst should start:
# - max size of burst containers
# - max average number of operations on burst container
/=+===============================+=\
/ : : \
)==: ROPE :==(
\ :_______________________________: /
\=+===============================+=/
ROPE ==> #Binary tree where:
# - whole tree represent single array being accessed by index
# - leaves contain sub-arrays as value
# - each node keep track of the cumulated length ("weight") of all the leaves of itself and its left subtree
#Also called "cord"
#Goal and properties:
# - similar to search trees
# - but optimized for accessing an array by index, not search a value
#Operations:
# - access:
# - O(log n)
# - if index <= current node's weight, go to left child. Otherwise right child.
# - concat:
# - O(log n) then O(1)
# - add new root with each tree as one branch
# - might need to re-balance tree
# - split:
# - O(log n)
# - access node at that index
# - go to first ancestor which is a right child of its parent
# - separate from ancestors and their left subtrees
# - separated nodes might not be connected, in which case they must be concatenated
# - might need to re-balance trees
# - insert:
# - O(log n)
# - split, then concat left subtree with element and right subtree with left subtree
# - delete:
# - O(log n)
# - two splits to isolate node, delete it then concat the rest
CONC TREE|LIST ==> #Self-balanced binary tree where:
# - whole tree represent single array being accessed by index
# - leaves contain single elements as value
# - node values keep track of height and size
#Operations:
# - access by index:
# - O(log n)
# - check children size recursively until reaching the leaf
# - insert|concat by index:
# - O(log n)
# - on the path from root to the two leaves around that index
# - find first node with height with 0|1 difference, and make them siblings on a new root
# - split: insert then remove that node
#Can be mixed with rope ("conc rope")