Skip to content

Latest commit

 

History

History
741 lines (360 loc) · 18 KB

CommonAlgorithm.md

File metadata and controls

741 lines (360 loc) · 18 KB

Common Algorithm

Sharks and Fish Model

  • rules
    • Sharks and fish live in a 2D ocean, moving, breeding, eating and dying.
    • The ocean is square and periodic, so fish swimming out to the left reenter at the right, and so on.
    • The ocean may either be discrete, where sharks and fish are constrained to move from one grid point to a neighboring grid point, or the ocean may be continuous.
    • In all cases, the sharks and fish move according to a "force law" which may be written
   force on a shark (or fish) =   force_External 
                                     (a current, felt independently by each 
                                      shark or fish, including a random component)
                                + force_Nearest_Neighbors 
                                     (sharks are strongly attracted by nearby fish)
                                + force_"Gravity"  
                                     (sharks are attracted to fish, and 
                                      fish repelled by sharks)

These three kinds of forces are parallelized in different ways: The external force can be computed independently for each fish, and will be the easiest force to parallelize. Forces which only depend on the nearest neighbors, or very close neighbors, require relatively little cooperation between processors and are next easiest to parallelize. Forces which depend on all other fish, like gravity, require the cleverest algorithms to compute efficiently (in serial or parallel).

Screen Shot 2022-02-09 at 9.31.03 PM

Screen Shot 2022-02-09 at 10.05.47 PM

Discrete Events System (Simulation)

Ref

  1. Berkeley CS 267 Lecture 5
  2. UCSB CS 240A
  • 特点
  1. model of world is discrete: time and space are discrete
  2. finite set of variables
  3. set of all variable at time t is called state
  4. each var update by computing transition function depend on othe rvaraibles
  • 分类

Screen Shot 2022-02-09 at 9.33.18 PM

asynchronize可以再细分

Screen Shot 2022-02-09 at 10.19.16 PM

Synchronous

Game of Life / S&F 3

fish的情况取决于neighbours

Screen Shot 2022-02-09 at 10.06.39 PM

每一个位置只依赖neighbour位置叫做stencil

Screen Shot 2022-02-09 at 9.38.09 PM

对domain decomposition切分的方法是square,而不是panel。square能最小process之间的comm

这里的切分相对简单,因为ocean是regular data structure

注意locality是通过分配每一个processor一大片相邻的ocean来处理的。

Screen Shot 2022-02-09 at 9.38.53 PM

ghost data

Screen Shot 2022-02-09 at 10.15.25 PM

Circuit Simulation

本质上也是domain decomposition,然后做切分。只不过涉及到多一层的graph partition的问题

Screen Shot 2022-02-09 at 9.41.38 PM

Asynchronous Conservative Scheduling

Screen Shot 2022-02-09 at 10.18.00 PM

容易出现deadlock的问题,因为只有event才能triggerupdate。

一个方法就是发送are you stuck too的信息然后进行验证deadlock,这样是serialized bottoelneck

Screen Shot 2022-02-09 at 10.21.20 PM

Particle Systems (Simulation)

Ref

  1. Berkeley CS 267 Lecture 5
  • 特点
  1. finite number of particle 而不是time/space, time and position are contigious
  2. moving in space follow newton's law f = ma
  • 例子

Screen Shot 2022-02-09 at 10.23.27 PM

Forces

Screen Shot 2022-02-09 at 10.24.29 PM

External Force
  • 特点
  1. 最简单parallel的force
  2. 每一个particle的force是独立的,不需要考虑其余的particle
  3. 可以使用简单的distribute particle on processors就可以做到。基于particle(不是domain)做切分

Nearby Force

  • 特点
  1. force 依赖nearby neighour
  2. eg collision
  3. 通过domain decomposition进行平行。

Screen Shot 2022-02-09 at 10.37.52 PM

Screen Shot 2022-02-09 at 10.38.22 PM

Far-Field Force

  • 特点
  1. force依赖全部的particle,是all to all interaction
  2. Eg gravity, protein folding
Simpleest algorithm

O(n^2)

通过把空间切分grid,然后把p个空间进行loop。这样每一个grid pair都会遇到彼此,从而计算对应force

Screen Shot 2022-02-09 at 10.41.07 PM

Particle-Mesh methods

O(n) / O(n log n)

Screen Shot 2022-02-09 at 10.43.06 PM

Tree Decomposition

Screen Shot 2022-02-09 at 10.44.04 PM

Lumped System / ODE (Simulation)

Ref

  1. Berkeley CS 267 Lecture 5
  • 特点
  1. location and entities are discrete (lumped), time is contigious
  2. system of lumped variables, depends on contigious parameters
    1. Lumped 是因为我们只计算endpoint/lump的值,不计算lump链接的edge
  • ODE

ordinary differential equation

differentiated w.r.t time

  • 分类 Given a set of ODEs, two kinds of questions are:
    • Compute the values of the variables at some time t
      • Explicit methods
      • Implicit methods
    • Compute modes of vibration
      • Eigenvalue problems
Circuit Example

Screen Shot 2022-02-09 at 10.47.40 PM

Explicit Methods

  • Forward Euler's Method

使用当前t的slope进行估算

Screen Shot 2022-02-09 at 10.52.20 PM

Implicit Methods

  • Backward Euler's Method

使用t+1的slope进行估算

Screen Shot 2022-02-09 at 10.53.06 PM

  • 分类

Screen Shot 2022-02-09 at 10.54.53 PM

Eigensolver

Screen Shot 2022-02-09 at 10.53.53 PM

Sparse Matrix Multiplication

CSR

Screen Shot 2022-02-09 at 11.03.44 PM

Parallel Sparse Matrix - Vector

  • common case

如何切分。

多个threads都会访问紫色的x,因为indirect addressing

Screen Shot 2022-02-09 at 11.05.10 PM

  • ideal case

processor之间没有沟通。

Screen Shot 2022-02-09 at 11.04.19 PM

Reordering

通过reordering sparse matrix内部数据,从而减少processor之间的communication

Screen Shot 2022-02-09 at 11.06.28 PM

Screen Shot 2022-02-09 at 11.08.13 PM

Continous System / PDE (Simulation)

Ref

  1. Berkeley CS 267 Lecture 6
  • 特点
  1. time and space are contigious
  • 分类

three kinds of PDE, two way to parallelize it

Screen Shot 2022-02-09 at 11.21.25 PM

Screen Shot 2022-02-09 at 11.24.30 PM

Explicit Methods

  • e.g Heat

一些基础的背景知识

Screen Shot 2022-02-09 at 11.28.44 PM

把问题转化成为了stencil computation的样子

Screen Shot 2022-02-09 at 11.29.28 PM

只不过这个stencil是weighted average of neighbour points

Screen Shot 2022-02-09 at 11.30.12 PM

可以转化为 matrix vector multiply。这里的matrix是sparse matrix

Screen Shot 2022-02-09 at 11.31.49 PM

explicit approach for parallel 也有问题。当time stamp大的时候就炸了

Screen Shot 2022-02-09 at 11.33.18 PM

Implicit Methods

  • e.g. heat on 1D

使用t+1的slope来进行估计

Screen Shot 2022-02-10 at 10.23.53 AM

实际上计算的时候会使用approximate的算法

问题也编程了sparse matrix vector求解的问题

Screen Shot 2022-02-10 at 10.24.19 AM

  • e.g. 2D implicit

也就是二维的5 poiny stencil

Screen Shot 2022-02-10 at 10.25.46 AM

Irregular Mesh

现实中的mesh经常是irregular的。

Screen Shot 2022-02-10 at 10.26.55 AM

需要把对应的mesh convert成matrix。然后进行matrix reorder来加速sparese matrix vector的计算

Screen Shot 2022-02-10 at 10.27.53 AM

Screen Shot 2022-02-10 at 10.28.05 AM

Stencil

参考

  1. MIT 6.172 Lecture15

stencil computation: update each point in an array by a fixed pattern.

Screen Shot 2021-10-18 at 2.58.28 PM

Basic

double u[2][N]; // even-odd trick

static inline double kernel(double * w)
{
    return w[0] + ALPHA * (w[-1] – 2*w[0] + w[1]);
}

for (size_t t = 1; t < T-1; ++t) { // time loop
  for(size_t x = 1; x < N-1; ++x)  // space loop
        u[(t+1)%2][x] = kernel( &u[t%2][x] );

Screen Shot 2021-10-18 at 3.01.48 PM

  • cache miss analysis

$Q = \Theta(NT/B)$ if N > M

each row need N/B number of cache miss, total T iterations.

  • 为什么loop的效果还可以

prefetching对于loop code的效果比较好。

但是当multi-core的时候,prefetch的效果就不好了,因为多个core都prefetch会导致bandwidth成为瓶颈。

Cache oblivious 3-point stencil

recursively travel trapezoidal region. 每一个trapezoid region至于内部的point相关。

Screen Shot 2021-10-18 at 3.03.53 PM

Screen Shot 2021-10-18 at 3.06.46 PM

  • base case

如果height=1, 计算all space-time points in trapezoid

Screen Shot 2021-10-18 at 3.05.28 PM

  • space cut

如果width >= 2 * height

right part depend on left part. left part not depend on right part

Screen Shot 2021-10-19 at 1.21.37 PM

  • time cut

如果width < 2 * height

Top part depend on bottom part. 首先便利下面,再便利上面

Screen Shot 2021-10-19 at 1.22.22 PM

  • cache miss analysis

recursive的调用

每一个leave有$\Theta(hw)$ points

每一个lead产生$\Theta(w/B)$ misses。因为当一行可以放在cache里的时候,往上面计算整个trapezoid都在cache里

$\Theta(NT/hw)$ leaves $$ \mathrm{Q}=\Theta(\mathrm{NT} / \mathrm{hw}) \cdot \Theta(\mathrm{w} / \mathcal{B})=\Theta\left(\mathrm{NT} / \mathcal{M}^{2}\right) \cdot \Theta(\mathcal{M} / \mathcal{B})=\Theta(\mathrm{NT} / \mathcal{M} \mathcal{B}) \ Q = \Theta(NT/M^{1/d}B) ~~\text{for d dims} $$

Parallel 3-point stencil

  • parallel space cut

黑色的部分可以parallel运算,然后再运算灰色的部分

Screen Shot 2021-10-19 at 1.27.39 PM

Screen Shot 2021-10-19 at 1.27.54 PM

Sort

参考

  1. MIT 6.172 Lecture15

Merge sort

void merge(int* C, int* A, int na, int* B, int nb)
{
  while( na > 0 && nb > 0 )
    if (*A <= *B )
      *C++ = *A++; na--;
    else
      *C++ = *B++; nb--;
  while( na > 0 )
    *C++ = *A++; na--;
  while( nb > 0 )
    *C++ = *B++; nb--;
}

void merge_sort(int* B, int* A, int n)
{
  if (n == 1 )
    B[0] = A[0];
  else
  {
    int C[n];
    cilk_spawn merge_sort( C, A, n/2 );
                       merge_sort( C+n/2, A+n/2, n-n/2);
    cilk_sync;
    merge( B, C, n/2, C+n/2, n-n/2 );
  }
}
  • work analysis

$$ \begin{aligned} W(n) &=2 W(n / 2)+\Theta(n) \\ &=\Theta(n \lg n) \end{aligned} $$

Screen Shot 2021-10-19 at 1.44.25 PM

  • cache miss analysis

$$ Q(n)=\left{\begin{array}{ll} \Theta(n / \mathcal{B}) & \text { if } n \leq c \mathcal{M}, \text { constant } c \leq 1 \ 2 Q(n / 2)+\Theta(n / \mathcal{B}) & \text { otherwise. } \end{array}\right. \

=\Theta((n / \mathcal{B}) \lg (n / \mathcal{M})) $$

当leave的时候,$\Theta(n/B)$, 每一次merge会产生$\Theta(n/B)$

Screen Shot 2021-10-19 at 2.11.24 PM

Multiway merging tournament

TODO 这个部分还没有总结

  • cache miss analysis

Bit Hack

参考

  1. MIT 6.172 L3
  2. Stanford Bit Twidding Hack http://graphics.stanford.edu/~seander/bithacks.html
~x = all bits inverse of x
-x = ~x + 1
   = inverse bit以后更改最后一个0为1
  • common operation

Screen Shot 2021-10-14 at 4.14.50 PM

  • set kth bit in a word to 1
y = x | (1 << k);

Screen Shot 2021-10-14 at 4.13.17 PM

  • clear kth bit in a word
y = x & ~(1 << k);

Screen Shot 2021-10-14 at 4.13.51 PM

  • flip the kth bit
y = x ^ (1 << k )

Screen Shot 2021-10-14 at 4.14.13 PM

  • extract a bit field (一串bit)
(x & mask) >> shift;

Screen Shot 2021-10-14 at 4.15.48 PM

  • Set a bit field
x = (x & ~mask) | ( ( y << shift) & mask );

![Screen Shot 2021-10-14 at 4.16.42 PM](Note.assets/Screen Shot 2021-10-14 at 4.16.42 PM.png)

  • swap

原因:xor is its own inverse

但是实际上因为这三行有dependency的关系,速度会比使用tmp variable的swap速度要慢,因为tmp variable swap还会使用到ILP(instruction level parallisim)

x = x^y; // mask of which position x,y different
y = x^y; // flip bits of y different of x
x = x^y; // flip bits of x different of y
  • non-branching min

branching因为有prefetch比较大的overhead,所以不是好事情

在实际代码中使用non-branching min来避免while中的if判断并不一定是好事情,因为compiler能够做更好的优化。很多时候会发成branchless 效果比 branch 的效果差的事情。

r = ( x < y ) ? x : y;
r = y ^ ( (x ^ y) & -(x < y));
  • mode

compute $(x+y)~ mod ~n$,假设x,y都在(0,n)中间

r = (x+y) % n;

// 优化版
z = x + y;
r = ( z < n ) ? z : z - n;

// 优化
z = x + y;
r = x - ( n & -(z >= n) );
  • mask of least significant bit
r = x & (-x);

Screen Shot 2021-10-14 at 4.27.55 PM

  • counting number of 1 bits in a word in x

传统方法. O(number of bit) complexity

for ( r = 0; x != 0; ++r )
  x &= x - 1;

divide and conquer 方法. O(length of word) complexity

Screen Shot 2021-10-14 at 6.09.56 PM

compiler方法:

int __buildtin_popcount(unsigned int x);