[DEPRECATION] Discussion on Fused attention and QiGEN #655

Qubitium · 2024-04-27T16:16:52Z

I want to start a discussion on major refractor, more like hack off unsupported or flat-out broken features in the current tree.

fused attention
qigen
triton v1

Questions:

Anyone still using these?
fused attention is broken as of latest transformers. Confirmed by @LaaZa and @fxmarty
is fused attention actually faster than marlin even if working properly?
Is SYCL [SYCL] Intel SYCL runtime support for AutoGPTQ #638 a candidate to replace qigen? Intel staff is willing to actively support this in autogptq. @abhilash1910
do we really need to support two triton kernels? @qwopqwop200 feels there no need for v1 as v2 has been battle tested and covers everything v1 does

With vllm's new marlin kernel that will support almost all groups sizes act-order, do we even need fused attention? #653

EDIT: added triton v1/v2 to the discussion

qwopqwop200 · 2024-04-28T14:15:05Z

1.Anyone still using these?

maybe. No.

2.is fused attention actually faster than marlin even if working properly?

No, I think it's more of a legacy of fused attention and would be good to get rid of.

3.Is SYCL [SYCL] Intel SYCL runtime support for AutoGPTQ #638 a candidate to replace qigen? Intel staff is willing to actively support this in autogptq.

qigen is a kernel that makes inference possible on the CPU. If this SYCL is CPU inferable, it seems like a good idea to remove qigen.

4.With vllm's new marlin kernel that will support almost all groups sizes act-order, do we even need fused attention?
maybe. no

Additionally, it seems like a good idea to remove triton v1 and replace it with triton v2, since all the features of triton v1 are supported by v2 and it is faster.

Qubitium · 2024-04-29T12:31:56Z

Qbits (intel) PenghuiCheng #660 is another qigen alternative and actively support by Intel.

zhewang1-intc · 2024-05-06T01:08:46Z

Hi @Qubitium ,

We greatly appreciate your interest in QBits. For a comprehensive introduction to QBits, please refer to the RFC. It's worth noting that QBits is still under active development, and we're committed to continuous improvement in both performance and features.

Performance enhancements:

Hybrid architecture CPU optimization: We're working on in-depth performance optimization for P/E core scheduling on hybrid architecture CPUs (12th Gen Core processor and beyond).
GEMV op optimization: We're also optimizing performance for GEMV-like operations.
AVX2 instruction optimization: For client CPUs based on AVX2 instructions, we're continuously optimizing performance.

Feature enhancements:

Support for more bit weights: We plan to support more bit weights in the future, such as 2/3 bits, and even 5/6/7 bits.

Regarding pr660 replacing qigen:

wonder if that can pr660 totally replace qigen? if cant, what other efforts should we take?

qwopqwop200 · 2024-05-06T01:41:26Z

I think the current Qbits can replace all parts except 2 and 3 bits of Qigen.
Qigen code:https://github.com/IST-DASLab/QIGen/tree/master

zhewang1-intc · 2024-05-16T03:06:23Z

I think the current Qbits can replace all parts except 2 and 3 bits of Qigen. Qigen code:https://github.com/IST-DASLab/QIGen/tree/master

hi, ITREX will release next version in late may, which support 2/3bit linear

This was referenced Apr 28, 2024

[DEPRECATION] Remove triton v1 #658

Open

[BUG/DEPRECATION] Remove fused attention/mlp #659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DEPRECATION] Discussion on Fused attention and QiGEN #655

[DEPRECATION] Discussion on Fused attention and QiGEN #655

Qubitium commented Apr 27, 2024 •

edited

qwopqwop200 commented Apr 28, 2024

Qubitium commented Apr 29, 2024 •

edited

zhewang1-intc commented May 6, 2024

qwopqwop200 commented May 6, 2024

zhewang1-intc commented May 16, 2024

[DEPRECATION] Discussion on Fused attention and QiGEN #655

[DEPRECATION] Discussion on Fused attention and QiGEN #655

Comments

Qubitium commented Apr 27, 2024 • edited

qwopqwop200 commented Apr 28, 2024

Qubitium commented Apr 29, 2024 • edited

zhewang1-intc commented May 6, 2024

Performance enhancements:

Feature enhancements:

Regarding pr660 replacing qigen:

qwopqwop200 commented May 6, 2024

zhewang1-intc commented May 16, 2024

Qubitium commented Apr 27, 2024 •

edited

Qubitium commented Apr 29, 2024 •

edited