parallel reduction #36

brentp · 2019-08-22T20:55:34Z

hi, I wanted to try out laser. I have this code working:

proc pmin(s: var seq[float32]): float32 {.noInline.}=

  var min_by_thread = newSeq[float32](omp_get_max_threads())
  for v in min_by_thread.mitems:
    v = float32.high

  omp_parallel_chunks_default(s.len, chunk_offset, chunk_size):
   #[
    attachGC()
    min_by_thread[omp_get_thread_num()] = min(
        min_by_thread[omp_get_thread_num()],
        min(s[chunk_offset..<(chunk_offset + chunk_size)])
        )
    detachGC()
    ]#

    var thread_min = min_by_thread[omp_get_thread_num()]
    #echo chunk_offset, " ", chunk_size

    for idx in chunk_offset ..< chunk_offset + chunk_size:
      thread_min = min(s[idx], thread_min)
    min_by_thread[omp_get_thread_num()] = thread_min

  result = min(min_by_thread)

do I need an omp_critical section for the final result? and/or any other problems?
And here is my calling code from your examples/

proc main() =
  randomize(42) # Reproducibility
  var x = newSeqWith(800_000_000, float32 rand(1.0))
  x[200_000_001] = -42.0'f32
  echo omp_get_num_threads(), " ", omp_get_max_threads()

  var t = cpuTime()
  let m = min(x)

  echo "serial  :", m, &" in {cpuTime() - t:.2f} seconds"

  for i in 0..10:
    t = cpuTime()
    let mp = x.pmin()
    doAssert abs(mp - m) < 1e-10
    echo "parallel:", mp, &" in {cpuTime() - t:.2f} seconds"

main()

The text was updated successfully, but these errors were encountered:

mratsim · 2019-08-24T21:33:07Z

Laser is still in research mode so plenty of things are implemented but not properly exposed in a high-level API.

To do a reduction you can do it as it's done for the sum reduction:

https://github.com/numforge/laser/blob/master/laser/primitives/reductions.nim
- Implement a fallback for no SIMD or not x86 https://github.com/numforge/laser/blob/af191c086b4a98c49049ecf18f5519dc6856cc77/laser/primitives/reductions.nim#L14-L34
- Implement the wrapper: https://github.com/numforge/laser/blob/af191c086b4a98c49049ecf18f5519dc6856cc77/laser/primitives/reductions.nim#L36-L79
- the underlying vectorized min/max/sum are implemented already https://github.com/numforge/laser/blob/af191c086b4a98c49049ecf18f5519dc6856cc77/laser/primitives/simd_math/reductions_sse3.nim#L57-L59

I will create min and max tomorrow, so that they are ready to use.

Alternatively, if you use a Tensor there are 4 ways to do parallel reduction in this example: https://github.com/numforge/laser/blob/af191c086b4a98c49049ecf18f5519dc6856cc77/examples/ex05_tensor_parallel_reduction.nim#L9-L95

Note that the underlying forEachStaged macro doesn't require an Tensor exactly, just a type that exposes rank, size, shape, strides and unsafe_raw_data as described here https://github.com/numforge/laser/tree/master/laser/strided_iteration#strided-parallel-iteration-for-tensors. So it works with seq if those are defined.

mratsim · 2019-08-25T08:35:43Z

I've added reduce_min and reduce_max (and renamed sum_kernel to reduce_sum) in #39.

They only works for float32 at the moment but if needed it's easy to extend to other types.

brentp · 2019-08-25T13:13:03Z

thanks very much for your links and the new reduce_min stuff. I can get this to work from the laser src directory but if I move it elsewhere I get a long traceback ending with:

In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/x86intrin.h:35:0,
                 from /home/brentp/.cache/nim/pmin_r/.nimble7pkgs7Laser-0.0.17laser7primitives7simd__math7reductions__sse3.nim.c:10:
/usr/lib/gcc/x86_64-linux-gnu/5/include/pmmintrin.h:68:1: error: inlining failed in call to always_inline ‘_mm_movehdup_ps’: target specific option mismatch
 _mm_movehdup_ps (__m128 __X)
 ^
/home/brentp/.cache/nim/pmin_r/.nimble7pkgs7Laser-0.0.17laser7primitives7simd__math7reductions__sse3.nim.c:56:7: error: called from here
  shuf = _mm_movehdup_ps(vec);

I can move the same file containing:

import
  random, sequtils,
  laser/primitives/reductions

proc main() =
  let interval = -1f .. 1f
  let size = 10_000_000
  let buf = newSeqWith(size, rand(interval))
  echo reduce_min(buf[0].unsafeAddr, buf.len)

main()

in and out of ~/src/laser and it works in the directory and does not without.
I am compiling with nim c -d:openmp -d:danger -d:fastmath -a -r pmin.nim

brentp · 2019-08-25T13:14:51Z

btw, this gives a nearly 5X speed improvement on my laptop on my example use-case so this will be a nice improvement!

mratsim · 2019-08-25T15:15:14Z

That's unfortunately one of Nim limitations.

If you look into reductions_sse3 file it calls min_ps_sse3 https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/simd_math/reductions_sse3.nim#L59 which uses sse3 intrinsics from https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/private/sse3_utils.nim#L8-L18

On x86_64 the compiler can only assume SSE2 support and more advanced SIMD instructions require an explicit compiler flag.

As I want the library to have a fallback when no SSE3 is available I can't just {.passC:"-msse3".} (though you can).

So the SSE3 flag is passed per-file (instead of globally) via an undocumented feature of nim.cfg: https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/nim.cfg#L32.

So you need to add yourfilename.always = "-msse3" if you use the primitive outside of laser.
Note that I don't pass define sse3_utils.always because min_ps_sse3 is inline and so not present in the sse3_utils C file.

Ultimately, @Araq said that he wants to provide a way to in .nim file to have per-file compilation flags which would be very helpful.

brentp · 2019-08-26T15:43:04Z

got it. thanks for the explanation.

mratsim mentioned this issue Aug 25, 2019

Add float32 implementation of min/max/sum #39

Merged

brentp closed this as completed Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel reduction #36

parallel reduction #36

brentp commented Aug 22, 2019

mratsim commented Aug 24, 2019

mratsim commented Aug 25, 2019

brentp commented Aug 25, 2019

brentp commented Aug 25, 2019

mratsim commented Aug 25, 2019

brentp commented Aug 26, 2019

parallel reduction #36

parallel reduction #36

Comments

brentp commented Aug 22, 2019

mratsim commented Aug 24, 2019

mratsim commented Aug 25, 2019

brentp commented Aug 25, 2019

brentp commented Aug 25, 2019

mratsim commented Aug 25, 2019

brentp commented Aug 26, 2019