Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEPRECATION] Discussion on Fused attention and QiGEN #655

Open
Qubitium opened this issue Apr 27, 2024 · 5 comments
Open

[DEPRECATION] Discussion on Fused attention and QiGEN #655

Qubitium opened this issue Apr 27, 2024 · 5 comments

Comments

@Qubitium
Copy link
Contributor

Qubitium commented Apr 27, 2024

@PanQiWei @LaaZa @fxmarty @qwopqwop200

I want to start a discussion on major refractor, more like hack off unsupported or flat-out broken features in the current tree.

  • fused attention
  • qigen
  • triton v1

Questions:

  • Anyone still using these?
  • fused attention is broken as of latest transformers. Confirmed by @LaaZa and @fxmarty
  • is fused attention actually faster than marlin even if working properly?
  • Is SYCL [SYCL] Intel SYCL runtime support for AutoGPTQ  #638 a candidate to replace qigen? Intel staff is willing to actively support this in autogptq. @abhilash1910
  • do we really need to support two triton kernels? @qwopqwop200 feels there no need for v1 as v2 has been battle tested and covers everything v1 does

With vllm's new marlin kernel that will support almost all groups sizes act-order, do we even need fused attention? #653

EDIT: added triton v1/v2 to the discussion

@qwopqwop200
Copy link
Collaborator

1.Anyone still using these?

maybe. No.

2.is fused attention actually faster than marlin even if working properly?

No, I think it's more of a legacy of fused attention and would be good to get rid of.

3.Is SYCL [SYCL] Intel SYCL runtime support for AutoGPTQ #638 a candidate to replace qigen? Intel staff is willing to actively support this in autogptq.

qigen is a kernel that makes inference possible on the CPU. If this SYCL is CPU inferable, it seems like a good idea to remove qigen.

4.With vllm's new marlin kernel that will support almost all groups sizes act-order, do we even need fused attention?
maybe. no

Additionally, it seems like a good idea to remove triton v1 and replace it with triton v2, since all the features of triton v1 are supported by v2 and it is faster.

@Qubitium
Copy link
Contributor Author

Qubitium commented Apr 29, 2024

Qbits (intel) PenghuiCheng #660 is another qigen alternative and actively support by Intel.

@zhewang1-intc
Copy link

Hi @Qubitium ,

We greatly appreciate your interest in QBits. For a comprehensive introduction to QBits, please refer to the RFC. It's worth noting that QBits is still under active development, and we're committed to continuous improvement in both performance and features.

Performance enhancements:

  1. Hybrid architecture CPU optimization: We're working on in-depth performance optimization for P/E core scheduling on hybrid architecture CPUs (12th Gen Core processor and beyond).

  2. GEMV op optimization: We're also optimizing performance for GEMV-like operations.

  3. AVX2 instruction optimization: For client CPUs based on AVX2 instructions, we're continuously optimizing performance.

Feature enhancements:

  1. Support for more bit weights: We plan to support more bit weights in the future, such as 2/3 bits, and even 5/6/7 bits.

Regarding pr660 replacing qigen:

wonder if that can pr660 totally replace qigen? if cant, what other efforts should we take?

@qwopqwop200
Copy link
Collaborator

I think the current Qbits can replace all parts except 2 and 3 bits of Qigen.
Qigen code:https://github.com/IST-DASLab/QIGen/tree/master

@zhewang1-intc
Copy link

I think the current Qbits can replace all parts except 2 and 3 bits of Qigen. Qigen code:https://github.com/IST-DASLab/QIGen/tree/master

hi, ITREX will release next version in late may, which support 2/3bit linear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants