Architecture/Code_design/Algorithms/Search/Search_tree/search_tree.theory.txt


               
   SEARCH_TREE  
               


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:          ALGORITHMS           :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


BINARY SEARCH ==>                 #Search within a sorted array that:
                                  #  - check if the median element is <= or >= than searched element
                                  #  - recurse over the lower|upper half
                                  #Also called "half-interval|logarithmic search" or "binary chop"
                                  #Average|worst time complexity: O(log₂ n)
                                  #Space complexity: O(1)
                                  #Variants (micro-optimizations):
                                  #  - "uniform binary search":
                                  #     - when array size is static
                                  #     - pre-compute integer offsets for each step and store in a lookup table,
                                  #       as opposed to calculate the middle index using the upper and lower bounds
                                  #  - "fibonacci search":
                                  #     - instead of checking the median element, check the highest index that is a Fibonnaci
                                  #       number
                                  #     - goal: on CPUs where << * or / is slow, as opposed to +

DATA STRUCTURES ==>               #  - search tree: as a tree
                                  #  - skip list: as a list of linked lists
                                  #  - trie: when values are arrays
                                  #  - rope, conc tree: when values are elements of a single array, accessed by index
                                  #See below for more information

EXPONENTIAL SEARCH ==>            #Similar to the binary search but instead with an exponentially increasing increment.
                                  #Also called "doubling|galloping|Struzik search".
                                  #E.g. with factor 2:
                                  #  - check index 1, 2, 4, 8, etc. until >= value
                                  #  - then recurse on the sub-array
                                  #Same time|space complexity as binary search but:
                                  #  - streaming algorithm: can be performed on array with unknown|infinite size
                                  #  - faster if element is at beginning, slower if at end
                                  #  - better locality of reference
                                  #Variant:
                                  #  - perform successive searches with decreasing increments
                                  #  - e.g. index **= 3 then index **= 2 then index *= 2
                                  #  - same time|space complexity
                                  #  - faster for elements at end, but slower at beginning

INTERPOLATION SEARCH ==>          #Similar to the binary search but:
                                  #  - instead of picking the median, pick the most likely index, proportionally to the value
                                  #     - i.e. instead of index at 50%, use index at (value - lower) / (upper - lower)
                                  #  - "proportional" is according to the probability distribution of the values
                                  #     - i.e. the calculation above is when distribution is uniform
                                  #Time complexity:
                                  #  - best: O(log log n) when values are following the probability distribution
                                  #  - worst: O(n)
                                  #Goal: faster when probability distribution of values is known

FRACTIONAL CASCADING ==>          #Data structure to perform a sequence of binary searches on k different arrays.
                                  #Each array:
                                  #  - precomputes the position of each of its value if they were inserted in the next array, i.e.:
                                  #     - this can be used as fingers to speed up looking up the next array
                                  #     - each extra array search performs in O(1)
                                  #  - contains half of the values (every odd index) of the next array
                                  #     - i.e. position of original value in original array must be stored first
                                  #     - this is built from last to first, i.e. it cascades
                                  #     - reason: otherwise if next array has very different range of values, the finger found
                                  #       in current array might not be helpful at all
                                  #Time complexity: O(log n + k) instead of O(k * log n/k)
                                  #Space complexity: O(n) because each array has double size, and contains two extra indices
                                  #Variants:
                                  #  - "catalog graph":
                                  #     - make each array point to several "next arrays" instead
                                  #     - i.e. arrays become nodes of a directed graph
                                  #     - goal: when order of sequence of binary searches is dynamic
                                  #  - "dynamic fractional cascading":
                                  #     - when data structure is dynamic, i.e. insert|delete are allowed


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:          SEARCH TREE          :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


SEARCH TREE ==>                   #Arborescence where nodes are ordered: left children < parent < right children.
                                  #Also called "ordered|sorted tree"
                                  #Optimized for:
                                  #  - search
                                  #  - serial access in sorted order
                                  #Time complexity:
                                  #  - search: O(m/2*logₘ n) where m is outdegree
                                  #  - insert|delete:
                                  #     - like search
                                  #     - first search of it, then perform constant-time insert|delete
                                  #  - tree must be balanced, otherwise:
                                  #     - height will average ᵐ√order, and complexity O(ᵐ√n)
                                  #     - can even be O(n) if search tree degenerates into a line graph
                                  #Values can be stored either on each node or only on leaves
                                  #Pros:
                                  #  - O(log n) operations (search|insert|delete)
                                  #  - automatically ordered, good for iteration
                                  #  - persistent data structure
                                  #  - recursive data structure
                                  #Cons:
                                  #  - fairly "average" data structure: space|time complexity is neither best nor wosrt
                                  #  - less time efficient if small number of items

BINARY SEARCH TREE (BST) ==>      #Search tree where outdegree = 2
                                  #Time complexity (if balanced): O(log n)

LEFT-CHILD RIGHT-SIBLING (LCRS)   #Tree obtained from applying the "Knuth transform" on any rooted directed tree.
 TREE ==>                         #Knuth transform:
                                  #  - make each sibling point to next sibling
                                  #  - only keep edge from parent to first child
                                  #Also called "child-sibling representation", "doubly-chained tree", "filial-heir chain"
                                  #Consequences:
                                  #  - pro: result is a binary tree, which is more space efficient than a k-ary tree
                                  #     - i.e. not knowing each vertex's outdegree requires dynamic arrays, which require
                                  #       extra pointers and dope vectors
                                  #  - con: siblings must be accessed in serial access not random access


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:        SELF-BALANCING         :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


SELF-BALANCING TREE ==>           #Tree that remains balanced after any set of operations
                                  #Also called "height-balancing tree"
                                  #Often on search trees

B-TREE ==>                        #Search tree where:
                                  #  - outdegree must be between ⌈U/2⌉ and U ("order")
                                  #     - no lower limit of outdegree for root
                                  #  - node's value is an array or numbers:
                                  #     - acting as separators between each two children
                                  #  - U can be picked to try to make nodes the same size as an I/O block
                                  #Similar to binary search tree:
                                  #  - which can be considered a b-tree of order 2
                                  #The higher the order:
                                  #  - better when nodes are read|write sequentially
                                  #     - i.e. when siblings nodes are usually read|write together
                                  #     - e.g. a file's blocks
                                  #  - instead of O(log₂ n), performs in O(m/2*logₘ n) to find node
                                  #    (which is worst), but then iterating|inserting|deleting siblings is faster
                                  #  - worst time complexity: some space is empty
                                  #  - less frequent balancing needed
                                  #If U is very high, can:
                                  #  - implement keys as a search tree themselves instead of a simple array
                                  #  - use delta encoding compression on keys (better space, worst time)
                                  #Self-balancing:
                                  #  - if node full|empty, can:
                                  #     - transfer ("rotation") children between siblings
                                  #     - merge|split siblings
                                  #        - must change parent keys as well
                                  #     - merge|split child to|from parent
                                  #  - must be done recursively from node to root
                                  #  - can be done either:
                                  #     - between operations
                                  #     - during operation, as nodes are being visited

B+ TREE ==>                       #B-tree where:
                                  #  - values are stored only on leaves
                                  #  - each leaf contains a pointer to next leaf (i.e. linked list)
                                  #I.e. like B-tree but with better sequential access accross several leaves.

RANDOM SEARCH TREE ==>            #When shuffling nodes before inserting many of them in a search tree.
                                  #Reason:
                                  #  - lower depth in average
                                  #  - e.g. inserting many sorted nodes (without balancing) would otherwise
                                  #    create a graph list as a subtree, increasing depth linearly
                                  #Variants:
                                  #  - do a random pick among all possible trees after insertion of each
                                  #    permutation of the nodes.
                                  #For a binary search tree, in average, depth will be:
                                  #  - mean: 2*log(n)+1
                                  #  - max: 4.3*log(n)

TREAP ==>                         #Tree of [VAL, KEY] pairs:
                                  #  - VALs behave like a search tree, usually binary
                                  #  - KEYs behave like a heap
                                  #  - self-balanced by making it a random search tree (using random KEYs)
                                  #Goal is self-balancing a search tree
                                  #Self-balance:
                                  #  - insertion:
                                  #     - perform binary search tree insertion using VAL
                                  #     - assign a random KEY
                                  #     - perform tree rotations on that node and its parent until KEY satisfies heap property
                                  #  - deletion:
                                  #     - perform tree rotations on that node and its right child until that node has no children
                                  #     - delete that node
                                  #  - split:
                                  #     - insert a node with VAL where split occurs and KEY being highest so that node becomes
                                  #       root
                                  #     - then remove that node, splitting tree
                                  #  - merge:
                                  #     - if max VAL of first tree if less than min VAL of second tree
                                  #     - then insert a node with VAL2 in-between, and KEY being lowest so that rotations
                                  #       performs the merge
                                  #     - then remove that node


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:             LIST              :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


SKIP LIST ==>                     #List of linked lists which have similar properties|behavior as a search tree:
                                  #  - each linked list is similar to a search tree level:
                                  #     - the first linked list contains only first element and last element
                                  #       (which is a sentinel node)
                                  #     - the last linked list contains all elements from first to last
                                  #     - other contains more and more elements, following either a deterministic or
                                  #       a random factor p
                                  #  - nodes from different linked lists but with same element are contiguous in memory
                                  #     - i.e. pointing down to next linked list does not require a pointer
                                  #Operations:
                                  #  - search:
                                  #     - goes through each linked list in order to find the highest smaller value
                                  #        - this requires backtracking one node, i.e. previous node should be kept in state
                                  #     - then goes down to next linked list
                                  #  - insert:
                                  #     - insert to lowest linked list, then use randomness to decide how many nodes to create
                                  #       in upper linked lists
                                  #        - i.e. each linked list has probability 1/p, providing lower linked list has a node
                                  #     - if values somewhat follow a random distribution, can also use them as the seed
                                  #     - i.e. similar to a search tree that self-balances with randomness (like treap)
                                  #Pros|cons with search trees:
                                  #  - similarly to B+ trees, iteration is efficient
                                  #"Indexable skiplist":
                                  #  - storing how much elements are skipped by each linked list's next pointer
                                  #  - goal: access by index O(log n) instead of O(n)


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:             TRIE              :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


TRIE ==>                          #Search tree where:
                                  #  - keys are arrays
                                  #  - nodes store individual keys' items
                                  #     - i.e. root-to-terminal-node path contains full key
                                  #     - root node has no key item
                                  #Also called "prefix tree"
                                  #Goal:
                                  #  - similar as other search trees but optimized for operations of arrays
                                  #  - only efficient when keys share common prefixes
                                  #Time complexity:
                                  #  - min|average|max O(m) where m is min|average|max array length
                                  #  - but each node step should be fast as it involves a single key item
                                  #     - as opposed e.g. to binary search tree which compares whole array at
                                  #       each step, i.e. complexity is actually O(n*log(n)) on arrays
                                  #Keys are arrays, including:
                                  #  - strings
                                  #  - bits ("bitwise trie")
                                  #Terminal node:
                                  #  - every leaf is a terminal node but internal nodes can be too
                                  #  - can be represented either:
                                  #     - with a flag on the node
                                  #        - sometimes called "white" (terminal) / "black" (non-terminal)
                                  #     - with an empty terminal node child
                                  #Edges from parent to children can be represented in several ways:
                                  #  - dynamic array ("adaptive"):
                                  #     - O(n) lookup time
                                  #  - LCRS tree:
                                  #     - O(n) lookup time but more space efficient
                                  #  - binary search tree:
                                  #     - called (although confusingly) "ternary search tree"
                                  #     - O(log n) lookup time
                                  #     - often represented as a tree with outdegree 3 where:
                                  #        - left and right children are siblings (i.e. the binary search subtree)
                                  #        - middle child is actual child (i.e. the trie supertree)
                                  #  - static array:
                                  #     - with every possible combination, filled with null pointers when no child
                                  #     - O(1) lookup time but O(m) space requirement, where m is number of combinations
                                  #     - "alphabet reduction":
                                  #        - reducing number of combinations, e.g. nibble-wise (16) instead of byte-wise (256)
                                  #        - con: increases tree height
                                  #     - "array mapped trie": when using "array bit mapping" (see its doc)
                                  #        - "concurrent trie"/"ctrie":
                                  #           - using a 32 bits map and a compare-and-swap atomic CPU instruction, so that
                                  #             trie can be thread-safe and lock-free
                                  #Operations:
                                  #  - finding nodes by key prefix

RADIX TREE ==>                    #Trie where nodes can store several key items, as opposed to just one.
                                  #Goal:
                                  #  - more space efficient
                                  #  - more time efficient on reads
                                  #Cons:
                                  #  - less space efficient if high density
                                  #  - less time efficient on writes
                                  #Operations:
                                  #  - insertion: during initial traversal, if there is a partially matching child, split it
                                  #  - deletion: after deletion, merge single-child nodes to their child

DAFSA ==>                         #DAFSA (see state doc) can used to represent tries in a more space efficient way.
                                  #As opposed to normal representation:
                                  #  - simple cycles (but not directed ones) are allowed
                                  #  - i.e. nodes can share multiple parents as long as it does not introduce directed cycles
                                  #  - i.e. space efficient when many common suffixes

BURST TRIE ==>                    #Trie ("access trie") where leaves ("burst") are another data structure ("burst containers").
                                  #Goal:
                                  #  - using time|space efficiency of trie where prefixes are shared
                                  #  - while using time|space efficiency of other data structures where suffixes are unique
                                  #Common burst containers: list, search tree.
                                  #Different strategies on when burst should start:
                                  #  - max size of burst containers
                                  #  - max average number of operations on burst container


                                             /=+===============================+=\
                                            /  :                               :  \
                                            )==:             ROPE              :==(
                                            \  :_______________________________:  /
                                             \=+===============================+=/


ROPE ==>                          #Binary tree where:
                                  #  - whole tree represent single array being accessed by index
                                  #  - leaves contain sub-arrays as value
                                  #  - each node keep track of the cumulated length ("weight") of all the leaves of itself and its left subtree
                                  #Also called "cord"
                                  #Goal and properties:
                                  #  - similar to search trees
                                  #  - but optimized for accessing an array by index, not search a value
                                  #Operations:
                                  #  - access:
                                  #     - O(log n)
                                  #     - if index <= current node's weight, go to left child. Otherwise right child.
                                  #  - concat:
                                  #     - O(log n) then O(1)
                                  #     - add new root with each tree as one branch
                                  #     - might need to re-balance tree
                                  #  - split:
                                  #     - O(log n)
                                  #     - access node at that index
                                  #     - go to first ancestor which is a right child of its parent
                                  #     - separate from ancestors and their left subtrees
                                  #     - separated nodes might not be connected, in which case they must be concatenated
                                  #     - might need to re-balance trees
                                  #  - insert:
                                  #     - O(log n)
                                  #     - split, then concat left subtree with element and right subtree with left subtree
                                  #  - delete:
                                  #     - O(log n)
                                  #     - two splits to isolate node, delete it then concat the rest

CONC TREE|LIST ==>                #Self-balanced binary tree where:
                                  #  - whole tree represent single array being accessed by index
                                  #  - leaves contain single elements as value
                                  #  - node values keep track of height and size
                                  #Operations:
                                  #  - access by index:
                                  #     - O(log n)
                                  #     - check children size recursively until reaching the leaf
                                  #  - insert|concat by index:
                                  #     - O(log n)
                                  #     - on the path from root to the two leaves around that index
                                  #     - find first node with height with 0|1 difference, and make them siblings on a new root
                                  #  - split: insert then remove that node
                                  #Can be mixed with rope ("conc rope")