-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[REQUEST] Deepspeed Inference Supports VL (vision language) model
enhancement
New feature or request
#6917
opened Dec 26, 2024 by
ethen8181
[BUG] Cannot access local variable 'locations' where it is not associated with a value
bug
Something isn't working
compression
#6913
opened Dec 25, 2024 by
Guodanding
[BUG] FAILED: multi_tensor_adam.cuda.o with
bug
Something isn't working
training
#6912
opened Dec 24, 2024 by
XueruiSu
[BUG]Convergence Issue: Training BERT for Embedding with Zero2 and 3 as compared to Torchrun
bug
Something isn't working
training
#6911
opened Dec 24, 2024 by
dawnik17
[BUG] RuntimeError: The size of tensor a (2048) must match the size of tensor b (1024) at non-singleton dimension 2
bug
Something isn't working
deepspeed-chat
Related to DeepSpeed-Chat
#6910
opened Dec 24, 2024 by
Lowlowlowlowlowlow
[REQUEST] is fp8 training supported?
enhancement
New feature or request
#6908
opened Dec 24, 2024 by
janelu9
[BUG] RuntimeError: Unable to JIT load the fp_quantizer op due to it not being compatible due to hardware/software issue. FP Quantizer is using an untested triton version (3.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
bug
Something isn't working
compression
#6906
opened Dec 23, 2024 by
GHBigD
[BUG] triton kernel, loss 0, grar-norm nan
bug
Something isn't working
training
#6902
opened Dec 22, 2024 by
mdy666
[REQUEST] Support for XLA/TPU
enhancement
New feature or request
#6901
opened Dec 21, 2024 by
radna0
prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted).
#6896
opened Dec 19, 2024 by
fabiogeraci
DeepSpeed with ZeRO3 strategy cannot build 'fused_adam'
bug
Something isn't working
training
#6892
opened Dec 18, 2024 by
LeonardoZini
How can DeepSpeed be configured to prevent the merging of parameter groups
#6878
opened Dec 16, 2024 by
CLL112
How do I know if stage-3 is a success by using deepspeed?
training
#6877
opened Dec 16, 2024 by
hwhyyds
[BUG] Cannot use --hostfile to start multi-node training in Docker.
bug
Something isn't working
training
#6875
opened Dec 16, 2024 by
Ind1x1
Windows wheel build error - Tried everything with all requirements you have
build
Improvements to the build and testing systems.
windows
#6871
opened Dec 14, 2024 by
FurkanGozukara
[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19
bug
Something isn't working
training
#6870
opened Dec 14, 2024 by
yafuly
[BUG] Mismatch of model parameters when using Sequence Parallel
bug
Something isn't working
training
#6868
opened Dec 13, 2024 by
chetwin-character
[BUG]When fine-tuning an LLM, the following error occurs after training for some time: self.optimizer.param_groups[param_group_id]['params'] = [] IndexError: list index out of range
bug
Something isn't working
training
#6857
opened Dec 12, 2024 by
tdtgi
Previous Next
ProTip!
Find all open issues with in progress development work with linked:pr.