Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel reduction #36

Closed
brentp opened this issue Aug 22, 2019 · 6 comments
Closed

parallel reduction #36

brentp opened this issue Aug 22, 2019 · 6 comments

Comments

@brentp
Copy link

brentp commented Aug 22, 2019

hi, I wanted to try out laser. I have this code working:

proc pmin(s: var seq[float32]): float32 {.noInline.}=

  var min_by_thread = newSeq[float32](omp_get_max_threads())
  for v in min_by_thread.mitems:
    v = float32.high

  omp_parallel_chunks_default(s.len, chunk_offset, chunk_size):
   #[
    attachGC()
    min_by_thread[omp_get_thread_num()] = min(
        min_by_thread[omp_get_thread_num()],
        min(s[chunk_offset..<(chunk_offset + chunk_size)])
        )
    detachGC()
    ]#

    var thread_min = min_by_thread[omp_get_thread_num()]
    #echo chunk_offset, " ", chunk_size

    for idx in chunk_offset ..< chunk_offset + chunk_size:
      thread_min = min(s[idx], thread_min)
    min_by_thread[omp_get_thread_num()] = thread_min

  result = min(min_by_thread)

do I need an omp_critical section for the final result? and/or any other problems?
And here is my calling code from your examples/

proc main() =
  randomize(42) # Reproducibility
  var x = newSeqWith(800_000_000, float32 rand(1.0))
  x[200_000_001] = -42.0'f32
  echo omp_get_num_threads(), " ", omp_get_max_threads()

  var t = cpuTime()
  let m = min(x)

  echo "serial  :", m, &" in {cpuTime() - t:.2f} seconds"

  for i in 0..10:
    t = cpuTime()
    let mp = x.pmin()
    doAssert abs(mp - m) < 1e-10
    echo "parallel:", mp, &" in {cpuTime() - t:.2f} seconds"

main()
@mratsim
Copy link
Owner

mratsim commented Aug 24, 2019

Laser is still in research mode so plenty of things are implemented but not properly exposed in a high-level API.

To do a reduction you can do it as it's done for the sum reduction:

I will create min and max tomorrow, so that they are ready to use.

Alternatively, if you use a Tensor there are 4 ways to do parallel reduction in this example: https://github.com/numforge/laser/blob/af191c086b4a98c49049ecf18f5519dc6856cc77/examples/ex05_tensor_parallel_reduction.nim#L9-L95

Note that the underlying forEachStaged macro doesn't require an Tensor exactly, just a type that exposes rank, size, shape, strides and unsafe_raw_data as described here https://github.com/numforge/laser/tree/master/laser/strided_iteration#strided-parallel-iteration-for-tensors. So it works with seq if those are defined.

@mratsim
Copy link
Owner

mratsim commented Aug 25, 2019

I've added reduce_min and reduce_max (and renamed sum_kernel to reduce_sum) in #39.

They only works for float32 at the moment but if needed it's easy to extend to other types.

@brentp
Copy link
Author

brentp commented Aug 25, 2019

thanks very much for your links and the new reduce_min stuff. I can get this to work from the laser src directory but if I move it elsewhere I get a long traceback ending with:

In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/x86intrin.h:35:0,
                 from /home/brentp/.cache/nim/pmin_r/.nimble7pkgs7Laser-0.0.17laser7primitives7simd__math7reductions__sse3.nim.c:10:
/usr/lib/gcc/x86_64-linux-gnu/5/include/pmmintrin.h:68:1: error: inlining failed in call to always_inline ‘_mm_movehdup_ps’: target specific option mismatch
 _mm_movehdup_ps (__m128 __X)
 ^
/home/brentp/.cache/nim/pmin_r/.nimble7pkgs7Laser-0.0.17laser7primitives7simd__math7reductions__sse3.nim.c:56:7: error: called from here
  shuf = _mm_movehdup_ps(vec);

I can move the same file containing:

import
  random, sequtils,
  laser/primitives/reductions

proc main() =
  let interval = -1f .. 1f
  let size = 10_000_000
  let buf = newSeqWith(size, rand(interval))
  echo reduce_min(buf[0].unsafeAddr, buf.len)

main()

in and out of ~/src/laser and it works in the directory and does not without.
I am compiling with nim c -d:openmp -d:danger -d:fastmath -a -r pmin.nim

@brentp
Copy link
Author

brentp commented Aug 25, 2019

btw, this gives a nearly 5X speed improvement on my laptop on my example use-case so this will be a nice improvement!

@mratsim
Copy link
Owner

mratsim commented Aug 25, 2019

That's unfortunately one of Nim limitations.

If you look into reductions_sse3 file it calls min_ps_sse3 https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/simd_math/reductions_sse3.nim#L59 which uses sse3 intrinsics from https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/private/sse3_utils.nim#L8-L18

On x86_64 the compiler can only assume SSE2 support and more advanced SIMD instructions require an explicit compiler flag.

As I want the library to have a fallback when no SSE3 is available I can't just {.passC:"-msse3".} (though you can).

So the SSE3 flag is passed per-file (instead of globally) via an undocumented feature of nim.cfg: https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/nim.cfg#L32.

So you need to add yourfilename.always = "-msse3" if you use the primitive outside of laser.
Note that I don't pass define sse3_utils.always because min_ps_sse3 is inline and so not present in the sse3_utils C file.

Ultimately, @Araq said that he wants to provide a way to in .nim file to have per-file compilation flags which would be very helpful.

@brentp
Copy link
Author

brentp commented Aug 26, 2019

got it. thanks for the explanation.

@brentp brentp closed this as completed Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants