Constructing KNN index in the end of each epoch. #16915
Replies: 7 comments 13 replies
-
Hi, Aidar. |
Beta Was this translation helpful? Give feedback.
-
Hi Samil, thank you for the suggestion. Sorry for late response. It took some time for me to adopt this code snippet for my use case. The error I'm getting: I checked the documentation. It seems like accelerator object does not have |
Beta Was this translation helpful? Give feedback.
-
can you try passing pl_module to concat_all_gather function and call all_gather of pl_module
|
Beta Was this translation helpful? Give feedback.
-
I'm here again) I modified the code for my task, but the structure is more or less the same. My model stuck after the callback execution. Any ideas why it could happen? I am using kaggle kernel with 2 gpus. These prints are executed only once:
Additionally, I can't see the validation metrics I calculated in the logger. GPUs remain busy. So I guess it is a deadlock. But I don't really understand why it happens. The notebook is available by this link: https://www.kaggle.com/code/aidarkhuzin1/codebertdsl
|
Beta Was this translation helpful? Give feedback.
-
After a deeper look, I understood that only one process reaches the part of code after validation loader loop. Take a look at this code snippet:
|
Beta Was this translation helpful? Give feedback.
-
I actually use an alternative way which doesn't include any external callbacks like this KNNOnlineEvaluator. In LightningDataModule's val_dataloader method, return both train_datasets and val_datasets as val_dataloaders.
since validation dataloaders will be processed in sequential manner you can access both train&val data in validation_step.
note that you should add "self.trainer.strategy.barrier()" after all_gather calls. If one of 4 processes goes into barrier and other one calls all_gather then it will wait forever.
|
Beta Was this translation helpful? Give feedback.
-
Can you please help me with
I explicitly check that
What are the possible reasons of such behavior? |
Beta Was this translation helpful? Give feedback.
-
My model needs to build an index for inference and metrics calculation. I understand that I can populate index with training outputs. However, the embedding that the model produces for the first batch (as an example) in the beginning and in the end of an epoch will differ. So I want to iterate the training dataset and populate an index in the end of each epoch.
One solution that comes in mind is to do it manually using for loop and proper callback. But I would like to use Loop API if it's possible. Additionally, I'm training the model with ddp strategy, so I will need to manage distribution myself (I would like to avoid this).
Maybe I'm missing some functionality. Thanks for your attention.
Best regards, Aidar Khuzin
Beta Was this translation helpful? Give feedback.
All reactions