Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Not-for-merge] Try to implement kernels along axis with one kernel for IndexAxis0 #668

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

qindazhu
Copy link
Collaborator

Just to show the idea of implementing such kernels (e.g. in Append/Stack) with one kernel, there are still a lot of TODOs, just make a PR so that we can fix them later as this task is not a P1 task currently.

Here are the benchmarks:

The header is IndexAxis0_(New)_NumAxes_Dim0_NumElements_..._AverageTime

IndexAxis0_2_46_500,int32,kCuda,50,20,65.080002
IndexAxis0_2_91_1001,int32,kCuda,100,20,64.686401
IndexAxis0_2_225_2018,int32,kCuda,200,20,65.086395
IndexAxis0_2_583_5002,int32,kCuda,500,20,65.670403
IndexAxis0_2_1112_10026,int32,kCuda,1000,20,65.625603
IndexAxis0_2_11029_100001,int32,kCuda,10000,20,67.537598
IndexAxis0_2_55306_500023,int32,kCuda,50000,20,93.817604
IndexAxis0_2_110946_1000025,int32,kCuda,100000,20,65.908806
IndexAxis0_2_222602_2000008,int32,kCuda,200000,20,172.318390

IndexAxis0New_2_46_500,int32,kCuda,50,20,63.302395
IndexAxis0New_2_91_1001,int32,kCuda,100,20,61.766396
IndexAxis0New_2_225_2018,int32,kCuda,200,20,61.952000
IndexAxis0New_2_583_5002,int32,kCuda,500,20,62.108803
IndexAxis0New_2_1112_10026,int32,kCuda,1000,20,62.512001
IndexAxis0New_2_11029_100001,int32,kCuda,10000,20,62.790401
IndexAxis0New_2_55306_500023,int32,kCuda,50000,20,62.396801
IndexAxis0New_2_110946_1000025,int32,kCuda,100000,20,62.252804
IndexAxis0New_2_222602_2000008,int32,kCuda,200000,20,78.288002


IndexAxis0_3_11_500,int32,kCuda,50,20,88.022392
IndexAxis0_3_14_1031,int32,kCuda,100,20,85.728004
IndexAxis0_3_23_2029,int32,kCuda,200,20,88.552002
IndexAxis0_3_65_5002,int32,kCuda,500,20,88.560005
IndexAxis0_3_119_10007,int32,kCuda,1000,20,88.529594
IndexAxis0_3_1261_100015,int32,kCuda,10000,20,88.505608
IndexAxis0_3_6192_500003,int32,kCuda,50000,20,88.671997
IndexAxis0_3_12424_1000003,int32,kCuda,100000,20,88.560005
IndexAxis0_3_24963_2000004,int32,kCuda,200000,20,97.596802

IndexAxis0New_3_11_500,int32,kCuda,50,20,65.598396
IndexAxis0New_3_14_1031,int32,kCuda,100,20,49.012802
IndexAxis0New_3_23_2029,int32,kCuda,200,20,68.468796
IndexAxis0New_3_65_5002,int32,kCuda,500,20,68.617599
IndexAxis0New_3_119_10007,int32,kCuda,1000,20,67.755196
IndexAxis0New_3_1261_100015,int32,kCuda,10000,20,90.870399
IndexAxis0New_3_6192_500003,int32,kCuda,50000,20,211.142395
IndexAxis0New_3_12424_1000003,int32,kCuda,100000,20,122.919998
IndexAxis0New_3_24963_2000004,int32,kCuda,200000,20,290.727997


IndexAxis0_4_6_500,int32,kCuda,50,20,146.524796
IndexAxis0_4_4_1047,int32,kCuda,100,20,7.521600
IndexAxis0_4_10_2003,int32,kCuda,200,20,147.745590
IndexAxis0_4_10_5003,int32,kCuda,500,20,146.284790
IndexAxis0_4_15_10001,int32,kCuda,1000,20,147.030396
IndexAxis0_4_141_100004,int32,kCuda,10000,20,147.188797
IndexAxis0_4_703_500025,int32,kCuda,50000,20,147.051208
IndexAxis0_4_1328_1000030,int32,kCuda,100000,20,147.732788
IndexAxis0_4_2722_2000014,int32,kCuda,200000,20,146.982407

IndexAxis0New_4_6_500,int32,kCuda,50,20,76.761597
IndexAxis0New_4_4_1047,int32,kCuda,100,20,7.526400
IndexAxis0New_4_10_2003,int32,kCuda,200,20,78.582405
IndexAxis0New_4_10_5003,int32,kCuda,500,20,76.758400
IndexAxis0New_4_15_10001,int32,kCuda,1000,20,76.278397
IndexAxis0New_4_141_100004,int32,kCuda,10000,20,147.857605
IndexAxis0New_4_703_500025,int32,kCuda,50000,20,81.131195
IndexAxis0New_4_1328_1000030,int32,kCuda,100000,20,697.676819
IndexAxis0New_4_2722_2000014,int32,kCuda,200000,20,557.192017

We can see that for large size the new approach will be much slower for the old one, one main reason is that we did not process multiple elements in one thread as the old approach did (so that we can save the time to index new_offsets and. old_offsets). ModernGpu did not support this, we need to write kernel by ourselves. Another reason is that there are a few if in the kernel.

return ans;
}
mgpu::context_t *mgpu_context = GetModernGpuAllocator(c);
auto lambda_set_ans = [=] __device__(int32_t index, int32_t seg,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have some documentation of the args of this lambda.
The documntation of transform_lbs
https://moderngpu.github.io/doc/api.html
unfortunately doesn't seem to explain this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. I know this is not for merge, but some explanation would be nice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, mainly

index is the index of element (i.e. idx01)
seg is the row id (i.e. idx0),
rank is the index in current seg/row (i.e. idx1)

also added the documentation in the code, thanks.

@danpovey
Copy link
Collaborator

OK, interesting.
If we were to want to merge this, I think there would need to be more comments, as that function is quite confusing due to its (pre-existing) complex design. But if not, I guess there's no need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants