Constructing KNN index in the end of each epoch. #16915

gfx73 · 2023-03-01T09:24:36Z

gfx73
Mar 1, 2023

My model needs to build an index for inference and metrics calculation. I understand that I can populate index with training outputs. However, the embedding that the model produces for the first batch (as an example) in the beginning and in the end of an epoch will differ. So I want to iterate the training dataset and populate an index in the end of each epoch.

One solution that comes in mind is to do it manually using for loop and proper callback. But I would like to use Loop API if it's possible. Additionally, I'm training the model with ddp strategy, so I will need to manage distribution myself (I would like to avoid this).

Maybe I'm missing some functionality. Thanks for your attention.

Best regards, Aidar Khuzin

samils7 · 2023-03-04T22:34:52Z

samils7
Mar 4, 2023

Hi, Aidar.
KNNOnlineEvaluator from lightning-bolts library is like what you need. You just need no modify to_device func since it uses only the last inputs of batches.

0 replies

gfx73 · 2023-03-09T15:13:47Z

gfx73
Mar 9, 2023
Author

Hi Samil, thank you for the suggestion.

Sorry for late response. It took some time for me to adopt this code snippet for my use case. concat_all_gather function does not work for me.

The error I'm getting: AttributeError: 'CUDAAccelerator' object has no attribute 'all_gather'

I checked the documentation. It seems like accelerator object does not have all_gather method indeed. I don't have any experience with this, so maybe I'm wrong.

0 replies

samils7 · 2023-03-09T21:25:02Z

samils7
Mar 9, 2023

can you try passing pl_module to concat_all_gather function and call all_gather of pl_module

def concat_all_gather(tensor: Tensor, pl_module: LightningModule) -> Tensor:
    return pl_module.all_gather(tensor).view(-1, *tensor.shape[1:])

1 reply

gfx73 Mar 10, 2023
Author

Yeah, I did the same. It works)

gfx73 · 2023-03-11T00:00:40Z

gfx73
Mar 11, 2023
Author

I'm here again)

I modified the code for my task, but the structure is more or less the same. My model stuck after the callback execution. Any ideas why it could happen?

I am using kaggle kernel with 2 gpus. These prints are executed only once:

print('calculating means')
print('means calculated')
print('mrr:', mrr)

Additionally, I can't see the validation metrics I calculated in the logger. GPUs remain busy. So I guess it is a deadlock. But I don't really understand why it happens.

The notebook is available by this link: https://www.kaggle.com/code/aidarkhuzin1/codebertdsl
Here is the modified code:


import torch
from pytorch_lightning import Callback, LightningModule, Trainer
from pytorch_lightning.accelerators import Accelerator
from torch import Tensor
from torch.nn import functional as F


class KNNOnlineEvaluator(Callback):
    def __init__(self, k: int = 10) -> None:
        """
        Args:
            k: k for k nearest neighbor
        """
        self.k = k

    def predict(self, query_feature: Tensor, feature_bank: Tensor, target_bank: Tensor) -> Tensor:
        """
        Args:
            query_feature: (B, D) a batch of B query vectors with dim=D
            feature_bank: (N, D) the bank of N known vectors with dim=D
            target_bank: (N, ) the bank of N known vectors' labels
        Returns:
            (B, ) the predicted labels of B query vectors
        """

        # Compute the pairwise distance matrix as we have:
        # ||a - b||^2 = ||a||^2  - 2 <a, b> + ||b||^2

        # shape (B, 1)
        # corresponds to ||a||^2
        c1 = torch.pow(query_feature, 2).sum(dim=-1).unsqueeze(-1)

        # corresponds to ||b||^2
        # shape (1, N)
        c2 = torch.pow(feature_bank, 2).sum(dim=-1).unsqueeze(0)

        # corresponds to ||a||^2 + ||b||^2 part
        # shape (B, N)
        c12 = c1 + c2

        # corresponds to <a, b>
        # shape (B, N)
        dot_product = torch.matmul(query_feature, feature_bank.t())

        # corresponds to ||a||^2 + ||b||^2 - 2 <a, b>
        # shape (B, N)
        distances = c12 - 2.0 * dot_product

        # shape (B, K)
        _, indices = distances.topk(k=self.k, dim=-1, largest=False)

        # shape (B, K)
        # getting labels from indexes
        labels = target_bank.expand(indices.shape[0], -1).gather(dim=1, index=indices)

        return labels

    def to_device(self, batch: Tensor, device: Union[str, torch.device]) -> Tuple[Tensor, Tensor, Tensor]:
        input_ids_batch = batch['input_ids_batch']
        attention_mask_batch = batch['attention_mask_batch']
        typed_seq_batch = batch['typed_seq_batch']

        input_ids_batch = input_ids_batch.to(device)
        attention_mask_batch = attention_mask_batch.to(device)
        typed_seq_batch = typed_seq_batch.to(device)
        return input_ids_batch, attention_mask_batch, typed_seq_batch

    @torch.no_grad()
    def on_validation_epoch_end(self, trainer: Trainer, pl_module: LightningModule) -> None:
        assert not trainer.model.training

        # Skip Sanity Check as train_dataloader is not initialized during Sanity Check
        if trainer.train_dataloader is None:
            return

        feature_bank, target_bank = [], []

        # go through train data to generate feature bank
        for batch in tqdm(trainer.train_dataloader, desc='Generating feature_bank'):
            input_ids_batch, attention_mask_batch, typed_seq_batch = self.to_device(batch, pl_module.device)
            embs = pl_module(input_ids_batch, attention_mask_batch)

            has_type_annotation_mask = typed_seq_batch != -100
            embs_with_type_annotation = embs[has_type_annotation_mask]
            type_ids = typed_seq_batch[has_type_annotation_mask]

            feature_bank.append(embs_with_type_annotation)
            target_bank.append(type_ids)

        # [N, D]
        feature_bank = torch.cat(feature_bank, dim=0)
        # [N]
        target_bank = torch.cat(target_bank, dim=0)

        # switch for PL compatibility reasons
        accel = (
            trainer.accelerator_connector
            if hasattr(trainer, "accelerator_connector")
            else trainer._accelerator_connector
        )
        
        # gather representations from other gpus
        if accel.is_distributed:
            feature_bank = concat_all_gather(feature_bank, pl_module)
            target_bank = concat_all_gather(target_bank, pl_module)

        all_mrrs, all_exact_matches, all_top3, all_top5, all_top10 = [], [], [], [], []
        # go through val data to predict labels and calculate metrics
        for val_dataloader in trainer.val_dataloaders:
            for batch in tqdm(val_dataloader, desc='Making knn search'):
                # getting embeddings
                input_ids_batch, attention_mask_batch, typed_seq_batch = self.to_device(batch, pl_module.device)
                embs = pl_module(input_ids_batch, attention_mask_batch)

                # filter only annotated tokens
                has_type_annotation_mask = typed_seq_batch != -100
                embs_with_type_annotation = embs[has_type_annotation_mask]
                type_ids = typed_seq_batch[has_type_annotation_mask]

                # calculating metrics
                mrr, exact_match, top3, top5, top10 = self.__calculate_metrics__(
                    embs_with_type_annotation,
                    type_ids,
                    feature_bank,
                    target_bank)

                # putting each metric to tensor
                all_mrrs.append(mrr.view(1))
                all_exact_matches.append(exact_match.view(1))
                all_top3.append(top3.view(1))
                all_top5.append(top5.view(1))
                all_top10.append(top10.view(1))

        # calculating means
        print('calculating means')
        mrr = torch.cat(all_mrrs).mean().item()
        exact_match = torch.cat(all_exact_matches).mean().item()
        top3 = torch.cat(all_top3).mean().item()
        top5 = torch.cat(all_top5).mean().item()
        top10 = torch.cat(all_top10).mean().item()
        print('means calculated')
        print('mrr:', mrr)

        # logging
        pl_module.log_dict(
            {
                'valid_mrr': mrr,
                'valid_exact_match': exact_match,
                'valid_top3': top3,
                'valid_top5': top5,
                'valid_top10': top10
            },
#             prog_bar=True,
            on_step=False,
            on_epoch=True,
            sync_dist=True
        )
        print('logged')

    def __calculate_metrics__(self, embs, type_ids, feature_bank, target_bank):
        # get labels (e.g type ids)
        pred_type_ids = self.predict(embs, feature_bank=feature_bank, target_bank=target_bank)

        mrr = self.__calculate_mrr__(type_ids, pred_type_ids, self.k)
        exact_match = self.__calculate_hit_rate__(type_ids, pred_type_ids, 1)
        top3 = self.__calculate_hit_rate__(type_ids, pred_type_ids, 3)
        top5 = self.__calculate_hit_rate__(type_ids, pred_type_ids, 5)
        top10 = self.__calculate_hit_rate__(type_ids, pred_type_ids, 10)

        return mrr, exact_match, top3, top5, top10

    def __calculate_mrr__(self, targets, preds, k):
        if k > preds.shape[-1]:
            raise Exception('k is greater than the number of predictions for each retreival')
        targets = targets.unsqueeze(-1).expand((-1, k))
        preds = preds[:, :k]
        reciprocal_ranks = 1 / (torch.argwhere(torch.eq(preds, targets))[:, 1] + 1)
        mrr_ = torch.mean(reciprocal_ranks)
        return mrr_

    def __calculate_hit_rate__(self, targets, preds, k):
        if k > preds.shape[-1]:
            raise Exception('k is greater than the number of predictions for each retreival')

        targets = targets.unsqueeze(-1).expand((-1, k))
        preds = preds[:, :k]
        return torch.eq(targets, preds).any(1).sum() / len(preds)


def concat_all_gather(tensor: Tensor, pl_module: pl.LightningModule) -> Tensor:
    return pl_module.all_gather(tensor).view(-1, *tensor.shape[1:])```

0 replies

gfx73 · 2023-03-11T22:25:24Z

gfx73
Mar 11, 2023
Author

After a deeper look, I understood that only one process reaches the part of code after validation loader loop. Take a look at this code snippet:

print('before knn')
# go through val data to predict labels and calculate metrics
for val_dataloader in trainer.val_dataloaders:
    for batch in tqdm(val_dataloader, desc='Making knn search'):
        # getting embeddings
        input_ids_batch, attention_mask_batch, typed_seq_batch = self.to_device(batch, pl_module.device)
        embs = pl_module(input_ids_batch, attention_mask_batch)

        # filter only annotated tokens
        has_type_annotation_mask = typed_seq_batch != -100
        embs_with_type_annotation = embs[has_type_annotation_mask]
        type_ids = typed_seq_batch[has_type_annotation_mask]

        # calculating metrics
        mrr, exact_match, top3, top5, top10 = self.__calculate_metrics__(
            embs_with_type_annotation,
            type_ids,
            feature_bank,
            target_bank)

        # putting each metric to tensor
        all_mrrs.append(mrr.view(1))
        all_exact_matches.append(exact_match.view(1))
        all_top3.append(top3.view(1))
        all_top5.append(top5.view(1))
        all_top10.append(top10.view(1))
                
print('after knn')

before knn is printed twice, and after knn is printed only once. I tried to put trainer.strategy.barrier() after the loop. But now the program gets stuck on barrier.

0 replies

samils7 · 2023-03-11T23:53:01Z

samils7
Mar 11, 2023

I actually use an alternative way which doesn't include any external callbacks like this KNNOnlineEvaluator.

In LightningDataModule's val_dataloader method, return both train_datasets and val_datasets as val_dataloaders.

def val_dataloader(self):
    dataloaders = []
    for rds, qds in zip(self.train_datasets, self.val_datasets):
        dataloaders.append(DataLoader(rds,....)
        dataloaders.append(DataLoader(qds,....)
    return dataloaders

since validation dataloaders will be processed in sequential manner you can access both train&val data in validation_step.
Simply in your LightningModule's val_step, check dataloader_idx parameter to recognize if coming batch is from train or val.
Collect embeddings and use in on_train_epoch_end to calculate metrics. Calculations in on_train_epoch_end are done by only rank_zero_process.

init:
    self.reference_embeddings, self.query_embeddings = {}, {}

def validation_step(self, batch, batch_idx, dataloader_idx=0):
    x, y = batch
    embeddings, y_hat = self(x)
    ......

    embeddings = embeddings.detach().cpu()
    ds_index = dataloader_idx // 2
    if dataloader_idx % 2 == 0:
        if batch_idx == 0:
            self.reference_embeddings[ds_index] = embeddings
            self.reference_y[ds_index] = y
        else:
            self.reference_embeddings[ds_index] = torch.cat((self.reference_embeddings[ds_index], embeddings), dim=0)
            self.reference_y[ds_index] = torch.cat((self.reference_y[ds_index], y), dim=0)
    else:
        if batch_idx == 0:
            self.query_embeddings[ds_index] = embeddings
            self.query_y[ds_index] = y
        else:
            self.query_embeddings[ds_index] = torch.cat((self.query_embeddings[ds_index], embeddings), dim=0)
            self.query_y[ds_index] = torch.cat((self.query_y[ds_index], y), dim=0)

note that you should add "self.trainer.strategy.barrier()" after all_gather calls. If one of 4 processes goes into barrier and other one calls all_gather then it will wait forever.

def on_train_epoch_end(self)
    query_embeddings, query_y = {}, {}
    reference_embeddings, reference_y = {}, {}
    dim = self.reference_embeddings[0].shape[-1]
    for k in self.query_embeddings:
        reference_embeddings[k] = self.all_gather(self.reference_embeddings[k]).reshape(-1, dim)
        reference_y[k] = self.all_gather(self.reference_y[k]).reshape(-1)

        query_embeddings[k] = self.all_gather(self.query_embeddings[k]).reshape(-1, dim)
        query_y[k] = self.all_gather(self.query_y[k]).reshape(-1)

    if self.trainer.is_global_zero:
        for k in query_embeddings:
           calculate_metrics(query_embeddings[k], query_y[k], reference_embeddings[k], reference_y[k])

    self.trainer.strategy.barrier()

10 replies

gfx73 Mar 12, 2023
Author

Really? I planned to use it. But when I tried to add features to index, my program was getting stuck. Thank you for mentioning.

gfx73 Mar 12, 2023
Author

It's so tough. I'm using plain tensors to find nearest neighbors. But the program gets stuck on argwhere operation now. Tensors are only (128, 10) size...

samils7 Mar 12, 2023

Then it shouldn't get stuck on that size. I see it handles (1M, 64) on single gpu.

I mean Faiss.

gfx73 Mar 12, 2023
Author

I saw that you transferred embedding to cpu on validation_step. Is it required step?

samils7 Mar 14, 2023

That is for preventing GPU memory explode.

gfx73 · 2023-03-13T22:34:12Z

gfx73
Mar 13, 2023
Author

Can you please help me with all_gather function? It somehow produces negative values.
Here is the code I have in on_train_epoch_end function:

        if torch.any(self.query_labels < 0):
            raise Exception('ahtung')
        print('query_labels before all gather:', self.query_labels[0], self.query_labels.shape)
        
        feature_bank = self.all_gather(self.feature_bank).view(-1, emb_dim)
        target_bank = self.all_gather(self.target_bank).view(-1)

        query_embeddings = self.all_gather(self.query_embeddings).view(-1, emb_dim)
        query_labels = self.all_gather(self.query_labels).reshape(-1)
        print('query_labels after gather:', query_labels[0], query_labels.shape)

I explicitly check that self.query_labels doesn't have negative values. But the prints as follows:

query_labels before all gather: tensor(3, device='cuda:1') torch.Size([21551])
query_labels before all gather: tensor(2, device='cuda:0') torch.Size([21955])
query_labels after gather: tensor(-4788154264236052348, device='cuda:1') torch.Size([43102])

What are the possible reasons of such behavior?

2 replies

samils7 Mar 14, 2023

This is very strange. I have no idea about the reason.

gfx73 Mar 14, 2023
Author

Should I report a bug?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constructing KNN index in the end of each epoch. #16915

{{title}}

Replies: 7 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Constructing KNN index in the end of each epoch. #16915

gfx73 Mar 1, 2023

Replies: 7 comments · 13 replies

samils7 Mar 4, 2023

gfx73 Mar 9, 2023 Author

samils7 Mar 9, 2023

gfx73 Mar 10, 2023 Author

gfx73 Mar 11, 2023 Author

gfx73 Mar 11, 2023 Author

samils7 Mar 11, 2023

gfx73 Mar 12, 2023 Author

gfx73 Mar 12, 2023 Author

samils7 Mar 12, 2023

gfx73 Mar 12, 2023 Author

samils7 Mar 14, 2023

gfx73 Mar 13, 2023 Author

samils7 Mar 14, 2023

gfx73 Mar 14, 2023 Author

gfx73
Mar 1, 2023

Replies: 7 comments 13 replies

samils7
Mar 4, 2023

gfx73
Mar 9, 2023
Author

samils7
Mar 9, 2023

gfx73 Mar 10, 2023
Author

gfx73
Mar 11, 2023
Author

gfx73
Mar 11, 2023
Author

samils7
Mar 11, 2023

gfx73 Mar 12, 2023
Author

gfx73 Mar 12, 2023
Author

gfx73 Mar 12, 2023
Author

gfx73
Mar 13, 2023
Author

gfx73 Mar 14, 2023
Author