Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Why Warp's tile-based matmul is much slower than torch's one? #461

Open
chaoming0625 opened this issue Jan 27, 2025 · 1 comment
Labels
question The issue author requires information

Comments

@chaoming0625
Copy link

I have tried the 1.6.0 version of warp, and tested the performance between tile-based matmul in warp and matmul of torch. I found that the performance of warp seems to be very slow. I am wondering why and in the future it is possible to solve this performance discrepancy?

TILE_M       TILE_N       TILE_K       BLOCK        Warp Time    Torch Time   Relative    
64           64           64           256          981.684936   363.559419   2.7002049312879994
64           64           64           512          1121.447108  363.559419   3.084632248243306
64           64           64           1024         1146.522702  363.559419   3.153604726164446
64           64           128          256          1436.224992  363.559419   3.950454635312309
64           64           128          512          1050.912843  363.559419   2.890621967354393
64           64           128          1024         1039.730605  363.559419   2.859864304602159
64           128          64           256          1321.610127  363.559419   3.6351970487663254
64           128          64           512          1231.751565  363.559419   3.388033704058703
64           128          64           1024         1123.240676  363.559419   3.089565604130311

Thanks.

@chaoming0625 chaoming0625 added the question The issue author requires information label Jan 27, 2025
@shi-eric
Copy link
Contributor

Hi @chaoming0625, there are various improvements on the way to close the performance gap between cuBLAS and cuBLASDx. Could you please share complete details about your benchmark so that we can understand this comparison better?

  • GPU
  • Memory clock, SM clock
  • Data type
  • Matrix size
  • Benchmark script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The issue author requires information
Projects
None yet
Development

No branches or pull requests

2 participants