Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The XL Model and the latest DeepSpeed #111

Open
mgrankin opened this issue May 28, 2023 · 0 comments
Open

The XL Model and the latest DeepSpeed #111

mgrankin opened this issue May 28, 2023 · 0 comments

Comments

@mgrankin
Copy link

Time flies swiftly in the world of ML. Sparse models have lost their popularity, and the code for them is no longer maintained. The older version of Triton isn't compatible with modern hardware, and DeepSpeed's sparse attention functionality doesn't work with the newer Triton versions.

However, there's good news: a workaround exists. Sparse attention isn't truly necessary for the XL model to function. Instead, it can be converted into a dense model. Simply remove the block corresponding to sparse_attention in the deepspeed_config file, and voila - all sparse layers are instantly transformed into dense layers. This setup aligns perfectly with existing weights, eliminating the need for retraining. This is logical since sparsity essentially serves as a mask, reducing most token weights to zero, but due to the softmax function, most weights are zero most of the time anyway, with only the most important ones being non-zero.

It's worthwhile to test the perplexity of the resulting dense model; its performance may even improve. The underlying principle is similar to that of dropout - when it is turned off during inference, the quality is often superior compared to during training.

Most importantly, you don't need a custom-built DeepSpeed for inferring the resulting model. Enjoy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant