- rules
- Sharks and fish live in a 2D ocean, moving, breeding, eating and dying.
- The ocean is square and periodic, so fish swimming out to the left reenter at the right, and so on.
- The ocean may either be discrete, where sharks and fish are constrained to move from one grid point to a neighboring grid point, or the ocean may be continuous.
- In all cases, the sharks and fish move according to a "force law" which may be written
force on a shark (or fish) = force_External
(a current, felt independently by each
shark or fish, including a random component)
+ force_Nearest_Neighbors
(sharks are strongly attracted by nearby fish)
+ force_"Gravity"
(sharks are attracted to fish, and
fish repelled by sharks)
These three kinds of forces are parallelized in different ways: The external force can be computed independently for each fish, and will be the easiest force to parallelize. Forces which only depend on the nearest neighbors, or very close neighbors, require relatively little cooperation between processors and are next easiest to parallelize. Forces which depend on all other fish, like gravity, require the cleverest algorithms to compute efficiently (in serial or parallel).
Ref
- Berkeley CS 267 Lecture 5
- UCSB CS 240A
- 特点
- model of world is discrete: time and space are discrete
- finite set of variables
- set of all variable at time t is called state
- each var update by computing transition function depend on othe rvaraibles
- 分类
asynchronize可以再细分
fish的情况取决于neighbours
每一个位置只依赖neighbour位置叫做stencil
对domain decomposition切分的方法是square,而不是panel。square能最小process之间的comm
这里的切分相对简单,因为ocean是regular data structure
注意locality是通过分配每一个processor一大片相邻的ocean来处理的。
ghost data
本质上也是domain decomposition,然后做切分。只不过涉及到多一层的graph partition的问题
容易出现deadlock的问题,因为只有event才能triggerupdate。
一个方法就是发送are you stuck too的信息然后进行验证deadlock,这样是serialized bottoelneck
Ref
- Berkeley CS 267 Lecture 5
- 特点
- finite number of particle 而不是time/space, time and position are contigious
- moving in space follow newton's law f = ma
- 例子
- 特点
- 最简单parallel的force
- 每一个particle的force是独立的,不需要考虑其余的particle
- 可以使用简单的distribute particle on processors就可以做到。基于particle(不是domain)做切分
- 特点
- force 依赖nearby neighour
- eg collision
- 通过domain decomposition进行平行。
- 特点
- force依赖全部的particle,是all to all interaction
- Eg gravity, protein folding
O(n^2)
通过把空间切分grid,然后把p个空间进行loop。这样每一个grid pair都会遇到彼此,从而计算对应force
O(n) / O(n log n)
Ref
- Berkeley CS 267 Lecture 5
- 特点
- location and entities are discrete (lumped), time is contigious
- system of lumped variables, depends on contigious parameters
- Lumped 是因为我们只计算endpoint/lump的值,不计算lump链接的edge
- ODE
ordinary differential equation
differentiated w.r.t time
- 分类 Given a set of ODEs, two kinds of questions are:
- Compute the values of the variables at some time t
- Explicit methods
- Implicit methods
- Compute modes of vibration
- Eigenvalue problems
- Compute the values of the variables at some time t
- Forward Euler's Method
使用当前t的slope进行估算
- Backward Euler's Method
使用t+1的slope进行估算
- 分类
- common case
如何切分。
多个threads都会访问紫色的x,因为indirect addressing
- ideal case
processor之间没有沟通。
通过reordering sparse matrix内部数据,从而减少processor之间的communication
Ref
- Berkeley CS 267 Lecture 6
- 特点
- time and space are contigious
- 分类
three kinds of PDE, two way to parallelize it
- e.g Heat
一些基础的背景知识
把问题转化成为了stencil computation的样子
只不过这个stencil是weighted average of neighbour points
可以转化为 matrix vector multiply。这里的matrix是sparse matrix
explicit approach for parallel 也有问题。当time stamp大的时候就炸了
- e.g. heat on 1D
使用t+1的slope来进行估计
实际上计算的时候会使用approximate的算法
问题也编程了sparse matrix vector求解的问题
- e.g. 2D implicit
也就是二维的5 poiny stencil
现实中的mesh经常是irregular的。
需要把对应的mesh convert成matrix。然后进行matrix reorder来加速sparese matrix vector的计算
参考
- MIT 6.172 Lecture15
stencil computation: update each point in an array by a fixed pattern.
double u[2][N]; // even-odd trick
static inline double kernel(double * w)
{
return w[0] + ALPHA * (w[-1] – 2*w[0] + w[1]);
}
for (size_t t = 1; t < T-1; ++t) { // time loop
for(size_t x = 1; x < N-1; ++x) // space loop
u[(t+1)%2][x] = kernel( &u[t%2][x] );
- cache miss analysis
each row need N/B number of cache miss, total T iterations.
- 为什么loop的效果还可以
prefetching对于loop code的效果比较好。
但是当multi-core的时候,prefetch的效果就不好了,因为多个core都prefetch会导致bandwidth成为瓶颈。
recursively travel trapezoidal region. 每一个trapezoid region至于内部的point相关。
- base case
如果height=1, 计算all space-time points in trapezoid
- space cut
如果width >= 2 * height
right part depend on left part. left part not depend on right part
- time cut
如果width < 2 * height
Top part depend on bottom part. 首先便利下面,再便利上面
- cache miss analysis
recursive的调用
每一个leave有$\Theta(hw)$ points
每一个lead产生$\Theta(w/B)$ misses。因为当一行可以放在cache里的时候,往上面计算整个trapezoid都在cache里
- parallel space cut
黑色的部分可以parallel运算,然后再运算灰色的部分
参考
- MIT 6.172 Lecture15
void merge(int* C, int* A, int na, int* B, int nb)
{
while( na > 0 && nb > 0 )
if (*A <= *B )
*C++ = *A++; na--;
else
*C++ = *B++; nb--;
while( na > 0 )
*C++ = *A++; na--;
while( nb > 0 )
*C++ = *B++; nb--;
}
void merge_sort(int* B, int* A, int n)
{
if (n == 1 )
B[0] = A[0];
else
{
int C[n];
cilk_spawn merge_sort( C, A, n/2 );
merge_sort( C+n/2, A+n/2, n-n/2);
cilk_sync;
merge( B, C, n/2, C+n/2, n-n/2 );
}
}
- work analysis
- cache miss analysis
$$ Q(n)=\left{\begin{array}{ll} \Theta(n / \mathcal{B}) & \text { if } n \leq c \mathcal{M}, \text { constant } c \leq 1 \ 2 Q(n / 2)+\Theta(n / \mathcal{B}) & \text { otherwise. } \end{array}\right. \
=\Theta((n / \mathcal{B}) \lg (n / \mathcal{M})) $$
当leave的时候,$\Theta(n/B)$, 每一次merge会产生$\Theta(n/B)$
TODO 这个部分还没有总结
- cache miss analysis
参考
- MIT 6.172 L3
- Stanford Bit Twidding Hack http://graphics.stanford.edu/~seander/bithacks.html
~x = all bits inverse of x
-x = ~x + 1
= inverse bit以后更改最后一个0为1
- common operation
- set kth bit in a word to 1
y = x | (1 << k);
- clear kth bit in a word
y = x & ~(1 << k);
- flip the kth bit
y = x ^ (1 << k )
- extract a bit field (一串bit)
(x & mask) >> shift;
- Set a bit field
x = (x & ~mask) | ( ( y << shift) & mask );
![Screen Shot 2021-10-14 at 4.16.42 PM](Note.assets/Screen Shot 2021-10-14 at 4.16.42 PM.png)
- swap
原因:xor is its own inverse
但是实际上因为这三行有dependency的关系,速度会比使用tmp variable的swap速度要慢,因为tmp variable swap还会使用到ILP(instruction level parallisim)
x = x^y; // mask of which position x,y different
y = x^y; // flip bits of y different of x
x = x^y; // flip bits of x different of y
- non-branching min
branching因为有prefetch比较大的overhead,所以不是好事情
在实际代码中使用non-branching min来避免while中的if判断并不一定是好事情,因为compiler能够做更好的优化。很多时候会发成branchless 效果比 branch 的效果差的事情。
r = ( x < y ) ? x : y;
r = y ^ ( (x ^ y) & -(x < y));
- mode
compute
r = (x+y) % n;
// 优化版
z = x + y;
r = ( z < n ) ? z : z - n;
// 优化
z = x + y;
r = x - ( n & -(z >= n) );
- mask of least significant bit
r = x & (-x);
- counting number of 1 bits in a word in x
传统方法. O(number of bit) complexity
for ( r = 0; x != 0; ++r )
x &= x - 1;
divide and conquer 方法. O(length of word) complexity
compiler方法:
int __buildtin_popcount(unsigned int x);