[Not-for-merge] Try to implement kernels along axis with one kernel for IndexAxis0 #668

qindazhu · 2021-02-19T03:35:27Z

Just to show the idea of implementing such kernels (e.g. in Append/Stack) with one kernel, there are still a lot of TODOs, just make a PR so that we can fix them later as this task is not a P1 task currently.

Here are the benchmarks:

The header is IndexAxis0_(New)_NumAxes_Dim0_NumElements_..._AverageTime

IndexAxis0_2_46_500,int32,kCuda,50,20,65.080002
IndexAxis0_2_91_1001,int32,kCuda,100,20,64.686401
IndexAxis0_2_225_2018,int32,kCuda,200,20,65.086395
IndexAxis0_2_583_5002,int32,kCuda,500,20,65.670403
IndexAxis0_2_1112_10026,int32,kCuda,1000,20,65.625603
IndexAxis0_2_11029_100001,int32,kCuda,10000,20,67.537598
IndexAxis0_2_55306_500023,int32,kCuda,50000,20,93.817604
IndexAxis0_2_110946_1000025,int32,kCuda,100000,20,65.908806
IndexAxis0_2_222602_2000008,int32,kCuda,200000,20,172.318390

IndexAxis0New_2_46_500,int32,kCuda,50,20,63.302395
IndexAxis0New_2_91_1001,int32,kCuda,100,20,61.766396
IndexAxis0New_2_225_2018,int32,kCuda,200,20,61.952000
IndexAxis0New_2_583_5002,int32,kCuda,500,20,62.108803
IndexAxis0New_2_1112_10026,int32,kCuda,1000,20,62.512001
IndexAxis0New_2_11029_100001,int32,kCuda,10000,20,62.790401
IndexAxis0New_2_55306_500023,int32,kCuda,50000,20,62.396801
IndexAxis0New_2_110946_1000025,int32,kCuda,100000,20,62.252804
IndexAxis0New_2_222602_2000008,int32,kCuda,200000,20,78.288002


IndexAxis0_3_11_500,int32,kCuda,50,20,88.022392
IndexAxis0_3_14_1031,int32,kCuda,100,20,85.728004
IndexAxis0_3_23_2029,int32,kCuda,200,20,88.552002
IndexAxis0_3_65_5002,int32,kCuda,500,20,88.560005
IndexAxis0_3_119_10007,int32,kCuda,1000,20,88.529594
IndexAxis0_3_1261_100015,int32,kCuda,10000,20,88.505608
IndexAxis0_3_6192_500003,int32,kCuda,50000,20,88.671997
IndexAxis0_3_12424_1000003,int32,kCuda,100000,20,88.560005
IndexAxis0_3_24963_2000004,int32,kCuda,200000,20,97.596802

IndexAxis0New_3_11_500,int32,kCuda,50,20,65.598396
IndexAxis0New_3_14_1031,int32,kCuda,100,20,49.012802
IndexAxis0New_3_23_2029,int32,kCuda,200,20,68.468796
IndexAxis0New_3_65_5002,int32,kCuda,500,20,68.617599
IndexAxis0New_3_119_10007,int32,kCuda,1000,20,67.755196
IndexAxis0New_3_1261_100015,int32,kCuda,10000,20,90.870399
IndexAxis0New_3_6192_500003,int32,kCuda,50000,20,211.142395
IndexAxis0New_3_12424_1000003,int32,kCuda,100000,20,122.919998
IndexAxis0New_3_24963_2000004,int32,kCuda,200000,20,290.727997


IndexAxis0_4_6_500,int32,kCuda,50,20,146.524796
IndexAxis0_4_4_1047,int32,kCuda,100,20,7.521600
IndexAxis0_4_10_2003,int32,kCuda,200,20,147.745590
IndexAxis0_4_10_5003,int32,kCuda,500,20,146.284790
IndexAxis0_4_15_10001,int32,kCuda,1000,20,147.030396
IndexAxis0_4_141_100004,int32,kCuda,10000,20,147.188797
IndexAxis0_4_703_500025,int32,kCuda,50000,20,147.051208
IndexAxis0_4_1328_1000030,int32,kCuda,100000,20,147.732788
IndexAxis0_4_2722_2000014,int32,kCuda,200000,20,146.982407

IndexAxis0New_4_6_500,int32,kCuda,50,20,76.761597
IndexAxis0New_4_4_1047,int32,kCuda,100,20,7.526400
IndexAxis0New_4_10_2003,int32,kCuda,200,20,78.582405
IndexAxis0New_4_10_5003,int32,kCuda,500,20,76.758400
IndexAxis0New_4_15_10001,int32,kCuda,1000,20,76.278397
IndexAxis0New_4_141_100004,int32,kCuda,10000,20,147.857605
IndexAxis0New_4_703_500025,int32,kCuda,50000,20,81.131195
IndexAxis0New_4_1328_1000030,int32,kCuda,100000,20,697.676819
IndexAxis0New_4_2722_2000014,int32,kCuda,200000,20,557.192017

We can see that for large size the new approach will be much slower for the old one, one main reason is that we did not process multiple elements in one thread as the old approach did (so that we can save the time to index new_offsets and. old_offsets). ModernGpu did not support this, we need to write kernel by ourselves. Another reason is that there are a few if in the kernel.

danpovey · 2021-02-19T03:51:17Z

k2/csrc/ragged_ops.cu

+      return ans;
+    }
+    mgpu::context_t *mgpu_context = GetModernGpuAllocator(c);
+    auto lambda_set_ans = [=] __device__(int32_t index, int32_t seg,


Would be nice to have some documentation of the args of this lambda.
The documntation of transform_lbs
https://moderngpu.github.io/doc/api.html
unfortunately doesn't seem to explain this.

.. I know this is not for merge, but some explanation would be nice.

Sure, mainly

index is the index of element (i.e. idx01)
seg is the row id (i.e. idx0),
rank is the index in current seg/row (i.e. idx1)

also added the documentation in the code, thanks.

danpovey · 2021-02-19T06:42:05Z

OK, interesting.
If we were to want to merge this, I think there would need to be more comments, as that function is quite confusing due to its (pre-existing) complex design. But if not, I guess there's no need.

qindazhu added 4 commits February 5, 2021 10:16

using pattern trasform_scan to merge two kenerls into one

6082f93

delete benchmark code as they should not be merged

4147985

delete useless header

c8d516e

try to implement the loop on axis with one kernel for IndexAxis0

bbe1d9e

danpovey reviewed Feb 19, 2021

View reviewed changes

add comments for transform_lbs

14383d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Not-for-merge] Try to implement kernels along axis with one kernel for IndexAxis0 #668

[Not-for-merge] Try to implement kernels along axis with one kernel for IndexAxis0 #668

qindazhu commented Feb 19, 2021

danpovey Feb 19, 2021

danpovey Feb 19, 2021

qindazhu Feb 19, 2021

danpovey commented Feb 19, 2021

[Not-for-merge] Try to implement kernels along axis with one kernel for IndexAxis0 #668

Are you sure you want to change the base?

[Not-for-merge] Try to implement kernels along axis with one kernel for IndexAxis0 #668

Conversation

qindazhu commented Feb 19, 2021

Here are the benchmarks:

danpovey Feb 19, 2021

Choose a reason for hiding this comment

danpovey Feb 19, 2021

Choose a reason for hiding this comment

qindazhu Feb 19, 2021

Choose a reason for hiding this comment

danpovey commented Feb 19, 2021