The XL Model and the latest DeepSpeed #111

mgrankin · 2023-05-28T19:42:23Z

Time flies swiftly in the world of ML. Sparse models have lost their popularity, and the code for them is no longer maintained. The older version of Triton isn't compatible with modern hardware, and DeepSpeed's sparse attention functionality doesn't work with the newer Triton versions.

However, there's good news: a workaround exists. Sparse attention isn't truly necessary for the XL model to function. Instead, it can be converted into a dense model. Simply remove the block corresponding to sparse_attention in the deepspeed_config file, and voila - all sparse layers are instantly transformed into dense layers. This setup aligns perfectly with existing weights, eliminating the need for retraining. This is logical since sparsity essentially serves as a mask, reducing most token weights to zero, but due to the softmax function, most weights are zero most of the time anyway, with only the most important ones being non-zero.

It's worthwhile to test the perplexity of the resulting dense model; its performance may even improve. The underlying principle is similar to that of dropout - when it is turned off during inference, the quality is often superior compared to during training.

Most importantly, you don't need a custom-built DeepSpeed for inferring the resulting model. Enjoy!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The XL Model and the latest DeepSpeed #111

The XL Model and the latest DeepSpeed #111

mgrankin commented May 28, 2023

The XL Model and the latest DeepSpeed #111

The XL Model and the latest DeepSpeed #111

Comments

mgrankin commented May 28, 2023