▪ check number of cache misses
告诉不同的cache hit rate
valgrind --tool=cachegrind ./a.out
- assignment policy
- create many tasks (unlimited)
- create limited worker
- for each worker, after completeing current task, worker thread inspect list and assigns itself to the next uncompleted task. 而不是一次性启动n thread。而是有限的thread主动找work
- 特点
- 常被用于实现divide and conquer的fork-join model
- 有理论最优的scheduling algorithm
- cilk_spwan
只是声明child可能会paralle运行,但是并不强制。实际是否paralle运行取决于runtime system。与parent的运行关系是asynchronize的。
- cilk_sync
return all calls spawned by current function
- 包含
- Linear Algebra. V2V M2V M2M
- FFT
- Random number generator
- Sparse Linear Algebra Functions.
- sparse matrix with dense matrix
- sparse matrix with dense vector
- summary statistics
- 参考
term | value |
---|---|
processor | Intel Xeon Phi Processor 7250 (knights landing) |
指令集 | avx2, avx512 |
avx512 | 两个AVX512流水线,同时计算两个vector |
peak | 44.8 GFlops/core |
l1 cache | 64 kb (32 i cache, 32 data), direct-mapped |
L2 cache | each tile (2 core) share 1mb l2 cache. Direct-mapped |
l3 cache | no l3 cache |
speed | 1.4 GHz = 1.4 * 10^9 clock cycle per second |
term | value |
---|---|
sm | 80 |
warp per sm | 64 |