The gradient does not seem to be updated during BERT training. #13741

Klassikcat · 2022-07-19T17:11:20Z

Klassikcat
Jul 19, 2022

My ENV

OS: ubuntu 20.04.4 LTS
Driver Version: 470.42.01
**CUDA Version: 11.4
n_gpu: 2(A40)

Pytorch: 1.11.0
Pytorch lightning: 1.6.5

Hello, my dear pytorch-lightning community members!

I'm using Pytorch-lightning for training BERT based on huggingface transformers BERT. It has trained well when i'm using my custom training boilerplate code. After i replace my boilerplate codes to pytorch-lightning, however, the model is not converging(in other words, loss is not improving.)

I guess the reason of this is optimizer code(BertAdam). BertAdam internally contains scheduler, so i checked configure_optimizer. I found num_train_optimzation_steps were wrong and learning rate set too low. so I set learning rate higher(2e-5) and changed num_train_optimization_steps from

 self.num_train_optimization_steps = int(
            train_count / train_datamodule.batch_size / self.config.gradient_accumulation_steps
        ) * self.config.train_epoch * train_datamodule.n_split

into self.trainer.estimated_stepping_batches / self.config.gradient_accumulation_steps

    def configure_optimizers(self) -> Dict[str, BertAdam]:
        """
        Configure the optimizer and scheduler.
        Returns:
            A dictionary containing optimizer and scheduler.
        """
       # gradient_accumulation_steps has set to 1, so num_train_optimization_steps == estimated_stepping_batches

        self.num_train_optimization_steps: int = int(
            self.trainer.estimated_stepping_batches / self.config.gradient_accumulation_steps
        )
        print("num_training_steps: ", self.trainer.estimated_stepping_batches)

        param_optimizer = list(self.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
             'weight_decay': 0.01},
            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
             'weight_decay': 0.0}
        ]
        self.optimizer = BertAdam(
            optimizer_grouped_parameters,
            lr=self.config.lr,
            warmup=self.config.warmup_proportion,
            t_total=self.num_train_optimization_steps
        )
        return {'optimizer': self.optimizer}

However, loss is still not converging.

Epoch 0: 100%|█| 577/577 [12:20<00:00,  1.28s/it, loss=3.43, v_num=25, train_loss=1.760, train_accuracy=0.000, train_precision=0.000, train_recall=0.000, train_f1=0.000, train_lr=1.6e-5, valid_loss=3.690, valid_accuracy=0.109, valid_precision=0.109, valid_recall=0.109, vaEpoch 0, global step 519: 'valid_accuracy' reached 0.10912 (best 0.10912), saving model to '/bin/garbage/epoch=0-step=519.ckpt' as top 5
Epoch 1: 100%|█| 577/577 [16:12<00:00,  1.69s/it, loss=3.5, v_num=25, train_loss=4.210, train_accuracy=0.000, train_precision=0.000, train_recall=0.000, train_f1=0.000, train_lr=1.2e-5, valid_loss=3.540, valid_accuracy=0.134, valid_precision=0.134, valid_recall=0.134, valEpoch 1, global step 1038: 'valid_accuracy' reached 0.13409 (best 0.13409), saving model to '/bin/garbage/epoch=1-step=1038.ckpt' as top 5
Epoch 2:   3%| | 18/577 [00:32<16:52,  1.81s/it, loss=3.58, v_num=25, train_loss=3.200, train_accuracy=0.125, train_precision=0.125, train_recall=0.125, train_f1=0.125, train_lr=1.19e-5, valid_loss=3.540, valid_accEpoch 2:   3%| | 19/577 [00:34<16:51,  1.81s/it, loss=3.58, v_num=25, train_loss=3.200, train_accuracy=0.125, train_precision=0.125, train_recall=0.125, train_f1=0.125, train_lr=1.19e-5, valid_loss=3.540, valid_accEpoch 2:   3%| | 19/577 [00:34<16:51,  1.81s/it, loss=3.56, v_num=25, train_loss=3.720, train_accuracy=0.125, train_Epoch 2: 100%|█| 577/577 [16:12<00:00,  1.68s/it, loss=3.63, v_num=25, train_loss=4.420, train_accuracy=0.000, train_precision=0.000, train_recall=0.000, train_f1=0.000, train_lr=8.01e-6, valid_loss=3.620, valid_accuracy=0.121, valid_precision=0.121, valid_recall=0.121, vEpoch 2, global step 1557: 'valid_accuracy' reached 0.12052 (best 0.13409), saving model to '/bin/garbage/epoch=2-step=1557.ckpt' as top 5
Epoch 3: 100%|█| 577/577 [16:12<00:00,  1.68s/it, loss=3.52, v_num=25, train_loss=0.969, train_accuracy=0.000, train_precision=0.000, train_recall=0.000, train_f1=0.000, train_lr=4.01e-6, valid_loss=3.580, valid_accuracy=0.126, valid_precision=0.126, valid_recall=0.126, vEpoch 3, global step 2076: 'valid_accuracy' reached 0.12595 (best 0.13409), saving model to '/bin/garbage/epoch=3-step=2076.ckpt' as top 5
Epoch 4: 100%|█| 577/577 [16:17<00:00,  1.69s/it, loss=3.5, v_num=25, train_loss=3.720, train_accuracy=0.000, train_precision=0.000, train_recall=0.000, train_f1=0.000, train_lr=7.71e-9, valid_loss=3.560, valid_accuracy=0.115, valid_precision=0.115, valid_recall=0.115, vaEpoch 4, global step 2595: 'valid_accuracy' reached 0.11509 (best 0.13409), saving model to '/bin/garbage/epoch=4-step=2595.ckpt' as top 5
Epoch 4: 100%|█| 577/577 [16:24<00:00,  1.71s/it, loss=3.5, v_num=25, train_loss=3.720, train_accuracy=0.000, train_precision=0.000, train_recall=0.000, train_f1=0.000, train_lr=7.71e-9, valid_loss=3.560, valid_accuracy=0.115, valid_precision=0.115, valid_recall=0.115, va

The question is,

I guess that there is an error on configure_optimizers because of num_train_optimization_step. Is my guess right? In this case, is there any solutions to solve this problem?
If my guess is wrong, is there anything else you can think of?(method/variable overriding, pytorch lightning module error, etc...) In case you are wondering, My code looks like this.

class PretrainedBert(pl.LightningModule): 
     def __init__(self, config):
        ...

class BertModel(PreTrainedBert):
    def __init__(self, config):
        super(BertModel, self).__init__(config)

    def forward(self, ...):
        # forward propagation 

class BertForSequenceClassification(PretrainedBert):
    def __init__(self, config, label_dict_or_path: str)
        super(BertForSequenceClassification, self).__init__(config)
        self.bert = BertModel(config)
        self.dropout = nn.Dropout()
        self.classifier = nn.Linear()

    def forward(self, ...):
        _, pooled_output = self.bert(...)
        pooled_output = self.dropout(pooled_output)
        y = self.classifier(pooled_output)
        return pooled_output

    def training_step(self, batch, batch_idx):
        ...

rohitgr7 · 2022-07-19T18:05:51Z

rohitgr7
Jul 19, 2022

just fyi!

self.trainer.estimated_stepping_batches takes care of gradient_accumulation_factor.

self.bert = BertModel(config)

just in case you are using pre-trained model, are you sure it's loaded correctly?

if it still doesn't work, can you share both the codes to check more?

2 replies

Klassikcat Jul 19, 2022
Author

Thanks for your answer.

As you pointed out, I'm using pre-trained bert weights according to below codes. It worked fine before i switched into pytorch-lightning. i checked bert weight path in namespace, and it seems pre-trained state_dict loaded well. (note: pre-trained model is not pytorch-lightning checkpoint. it's normal pytorch state_dict.)

Pre-trained weight loader

here's my pre-trained model loading codes.

class PreTrainedBertModel(pl.LightningModule):
    """ An abstract class to handle weights initialization and
        a simple interface for dowloading and loading pretrained trainers.
    """

    def __init__(self, config, *inputs, **kwargs):
        super(PreTrainedBertModel, self).__init__()
        if not isinstance(config, BertConfig):
            raise ValueError(
                "Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
                "To create a model from a Google pretrained model use "
                "`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
                    self.__class__.__name__, self.__class__.__name__
                ))
        self.config = config

    def init_bert_weights(self, module):
        """ Initialize the weights.
        """
        if isinstance(module, (nn.Linear, nn.Embedding)):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        elif isinstance(module, BertLayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        if isinstance(module, nn.Linear) and module.bias is not None:
            module.bias.data.zero_()

    @classmethod
    def from_pretrained(cls, pretrained_model_name, state_dict=None, cache_dir=None, dictconfig: dict = None, *inputs, **kwargs):
        """
        Instantiate a PreTrainedBertModel from a pre-trained model file or a pytorch state dict.
        Download and cache the pre-trained model file if needed.

        Params:
            pretrained_model_name: either:
                - a str with the trainers of a pre-trained model to load selected in the list of:
                    . `bert-base-uncased`
                    . `bert-large-uncased`
                    . `bert-base-cased`
                    . `bert-large-cased`
                    . `bert-base-multilingual-uncased`
                    . `bert-base-multilingual-cased`
                    . `bert-base-chinese`
                - a path or url to a pretrained model archive containing:
                    . `bert_config.json` a configuration file for the model
                    . `pytorch_model.vocabs` a PyTorch dump of a BertForPreTraining instance
            cache_dir: an optional path to a folder in which the pre-trained trainers will be cached.
            state_dict: an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained trainers
            *inputs, **kwargs: additional input for the specific Bert class
                (ex: num_labels for BertForSequenceClassification)
        """
        model_config_map = {
            'PreTrainedBertModel': BertConfig,
            'BertModel': BertConfig,
            'BertForPreTraining': BertConfig,
            'BertForMaskedLM': BertConfig,
            'BertForSequenceClassification': BertForSequenceClassificationConfig,
            'BertForSentimentAnalysis': BertForSentimentAnalysisConfig,
            'BertForNextSentencePrediction': BertConfig,
            'BertForTokenClassification': BertConfig,
            'BertForQuestionAnswering': BertConfig,
            'BertForMultipleChoice': BertConfig,
        }

        bert_configuration = model_config_map.get(cls.__name__)
        if pretrained_model_name in PRETRAINED_MODEL_ARCHIVE_MAP:
            archive_file = PRETRAINED_MODEL_ARCHIVE_MAP[pretrained_model_name]
        else:
            archive_file = pretrained_model_name
        # redirect to the cache, if necessary
        try:
            resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir)
        except FileNotFoundError:
            logger.error(
                "Model trainers '{}' was not found in model trainers list ({}). "
                "We assumed '{}' was a path or url but couldn't find any file "
                "associated to this path or url.".format(
                    pretrained_model_name,
                    ', '.join(PRETRAINED_MODEL_ARCHIVE_MAP.keys()),
                    archive_file))
            return None
        if resolved_archive_file == archive_file:
            logger.info("loading archive file {}".format(archive_file))
        else:
            logger.info("loading archive file {} from cache at {}".format(
                archive_file, resolved_archive_file))
        tempdir = None
        if os.path.isdir(resolved_archive_file):
            serialization_dir = resolved_archive_file
        else:
            # Extract archive to temp dir
            tempdir = tempfile.mkdtemp()
            logger.info("extracting archive file {} to temp dir {}".format(
                resolved_archive_file, tempdir))
            with tarfile.open(resolved_archive_file, 'r:gz') as archive:
                archive.extractall(tempdir)
            serialization_dir = tempdir
        # Load config
        if isinstance(dictconfig, dict):
            config = bert_configuration.from_dict(dictconfig)
        else:
            config_file = os.path.join(serialization_dir, CONFIG_NAME)
            config = bert_configuration.from_json_file(config_file)

        logger.info("Model config {}".format(config))
        # Instantiate model.
        model = cls(config, *inputs, **kwargs)
        if state_dict is None:
            weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
            state_dict = torch.load(weights_path)

        old_keys = []
        new_keys = []
        for key in state_dict.keys():
            new_key = None
            if 'gamma' in key:
                new_key = key.replace('gamma', 'weight')
            if 'beta' in key:
                new_key = key.replace('beta', 'bias')
            if new_key:
                old_keys.append(key)
                new_keys.append(new_key)
        for old_key, new_key in zip(old_keys, new_keys):
            state_dict[new_key] = state_dict.pop(old_key)

        missing_keys = []
        unexpected_keys = []
        error_msgs = []
        # copy state_dict so _load_from_state_dict can modify it
        metadata = getattr(state_dict, '_metadata', None)
        state_dict = state_dict.copy()
        if metadata is not None:
            state_dict._metadata = metadata

        def load(module, prefix=''):
            local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
            module._load_from_state_dict(
                state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
            for name, child in module._modules.items():
                if child is not None:
                    load(child, prefix + name + '.')

        load(model, prefix='' if hasattr(model, 'bert') else 'bert.')
        if len(missing_keys) > 0:
            logger.info("Weights of {} not initialized from pretrained model: {}".format(
                model.__class__.__name__, missing_keys))
        if len(unexpected_keys) > 0:
            logger.info("Weights from pretrained model not used in {}: {}".format(
                model.__class__.__name__, unexpected_keys))
        if tempdir:
            # Clean up temp dir
            shutil.rmtree(tempdir)
        return model

This class also contains these methods down below.

    def info(self, dictionary: dict) -> None:
        for key, value in dictionary.items():
            self.log(key, value, prog_bar=True)

    def configure_optimizers(self) -> Dict[str, Union[BertAdam, WarmupReduceLROnPlateauScheduler]]:
        """
        Configure the optimizer and scheduler.
        Returns:
            A dictionary containing optimizer and scheduler.
        """
        self.num_train_optimization_steps: int = int(
            self.trainer.estimated_stepping_batches / self.config.gradient_accumulation_steps
        )
        print("num_training_steps: ", self.trainer.estimated_stepping_batches)

        param_optimizer = list(self.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
             'weight_decay': 0.01},
            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
             'weight_decay': 0.0}
        ]
        self.optimizer = BertAdam(
            optimizer_grouped_parameters,
            lr=self.config.lr,
            warmup=self.config.warmup_proportion,
            t_total=self.num_train_optimization_steps
        )
        return {'optimizer': self.optimizer}

    def training_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
        """
        Training step. Called for each train batch. May be used to update the model or
        do any other training specific things.
        Args:
            batch: Tuple of input and target data.
            batch_idx: Index of the batch.
        output:
            A dictionary containing logits.
        """
        raise NotImplementedError

    def validation_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
        """
        Validation step. Called for each validation batch. May be used to evaluate the model or
        do any other validation specific things.
        Args:
            batch: Tuple of input and target data.
            batch_idx: Index of the batch.
        output:
            A dictionary containing logits.
        """
        raise NotImplementedError

    def test_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
        """
        Test step. Called for each test batch. May be used to evaluate the model or
        do any other test specific things.
        Args:
            batch: Tuple of input and target data.
            batch_idx: Index of the batch.
        output:
            A dictionary containing logits.
        """
        raise NotImplementedError

    def predict_step(self, batch: Tuple, batch_idx: int) -> Dict:
        """
        predict step. Called for each predict batch. May be used to predict output of the input batch or
        do any other predict specific things.
        Args:
            batch: Tuple of input and target data.
            batch_idx: Index of the batch.
        output:
            A dictionary containing results.
        """
        raise NotImplementedError

    def training_epoch_end(
            self,
            outputs: Optional[List[Dict[str, torch.Tensor]]] = None,
            epoch: Optional[int] = None,
            eval_metrics: Optional[float] = None
    ) -> None:
        self.scheduler_step(epoch + 1, epoch, eval_metrics)

    @staticmethod
    def get_logits(logits: np.array) -> Tuple[list, list]:
        """
        get logits according to the hyperparameter from __call__ method.
        Args:
            logits: logits from classification. type: torch.tensor
        Returns:
            logits: logits from classification. type: torch.Tensor
            conf: confidence from classification. type: torch.Tensor
        """
        logit = int(np.argmax(logits))
        conf_init = int(np.max(logits))
        logits = [logit]
        conf = [conf_init]
        return logits, conf

    @staticmethod
    def get_topk_logits(logits, topk: int = 2, min_match_rate: float = .3) -> Tuple[list, list]:
        """
        get logits according to the hyperparameter from __call__ method.
        Args:
            logits: logits from classification. type: torch.tensor
            topk: topk from classification. type: int
            min_match_rate: min_match_rate from classification. type: float
        Returns:
            logits: logits from classification. type: torch.Tensor
            conf: confidence from classification. type: torch.Tensor
        """
        match_min = list()
        for logit in logits:
            if logit > min_match_rate:
                match_min.append(logit)
        if len(match_min) == 0:
            return list(np.argmax(logits)), list(np.max(logits))
        else:
            topk = topk - (topk - len(match_min)) if topk > len(match_min) else topk
            topk_dict = {lgt: idx for idx, lgt in logits}
            match_min = logits[:topk].sort(reverse=True)
            logits = [topk_dict[logit] for logit in match_min]
            return logits, match_min

    def get_params(self, get_name: bool = False) -> None:
        """
        get parameters from model.
        Args:
            get_name: get name from model. type: bool
        Returns:
            None
        """
        if get_name is True:
            condition = self.named_parameters()
        else:
            condition = self.parameters()
        for param in condition:
            print(param)

    def freeze_params(self):
        """
        freeze model encoder according to the hyperparameter from __init__ method.
        """
        freeze_params = self.config.freeze_param_num
        model_params = 0
        for child in self.children():
            for _ in child.parameters():
                model_params += 1
        assert freeze_params < model_params

        freeze_num = 0
        for child in self.children():
            for param in child.parameters():
                if freeze_num == freeze_params:
                    break
                param.requires_grad = False
                freeze_num += 1

    def __initialize_training_step__(self):
        """
        initialize all global variables can be different in each models.
        for example, label_map, tokenizer, etc.
        """
        if self.config.freeze_param_num >= 1:
            self.freeze_params()

    def forward(self, *args):
        """
        Forward propagation method.
        """
        raise NotImplementedError

Here's my BertModel and BertForSequenceClassification Code.

BertModel

BertModel does not contain training_step, validation_step, test_step and predict_step.

Could this be a reason of the problem I'm currently facing?

class BertModel(PreTrainedBertModel):
    """BERT model ("Bidirectional Embedding Representations from a Transformer").

    Params:
        config: a BertConfig class instance with the configuration to build a new model

    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.

    Outputs: Tuple of (encoded_layers, pooled_output)
        `encoded_layers`: controled by `output_all_encoded_layers` argument:
            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
                to the last attention block of shape [batch_size, sequence_length, hidden_size],
        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
            classifier pretrained on top of the hidden state associated to the first character of the
            input (`CLF`) to trainers on the Next-Sentence task (see BERT's paper).

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

    config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = modeling.BertModel(config=config)
    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
    ```
    """

    def __init__(self, config):
        super(BertModel, self).__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=True):
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # We create a 3D attention mask from a 2D tensor mask.
        # Sizes are [batch_size, 1, 1, to_seq_length]
        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
        # this attention mask is more simple than the triangular masking of causal attention
        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        embedding_output = self.embeddings(input_ids, token_type_ids)
        encoded_layers = self.encoder(embedding_output, extended_attention_mask,
                                      output_all_encoded_layers=output_all_encoded_layers)
        sequence_output = encoded_layers[-1]
        pooled_output = self.pooler(sequence_output)
        if not output_all_encoded_layers:
            encoded_layers = encoded_layers[-1]
        return encoded_layers, pooled_output

BertForSequenceClassification

Unlike BertModel, BertForSequenceClassification contains training_step, validation_step, test_step and predict_step.

class BertForSequenceClassification(PreTrainedBertModel):
    """BERT model for classification.
    This module is composed of the BERT model with a linear layer on top of
    the pooled output.

    Params:
        `config`: a BertConfig class instance with the configuration to build a new model.
        `label_dict_or_path`: Dictionary or path to a file containing a label dictionary.

    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
            with indices selected in [0, ..., num_labels].

    Outputs:
        if `labels` is not `None`:
            Outputs the CrossEntropy classification loss of the output with the labels.
        if `labels` is `None`:
            Outputs the classification logits of shape [batch_size, num_labels].

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    label_dict_or_path = {'negative': 0, 'positive': 1}

    model = BertForSequenceClassification(config, num_labels)
    logits = model(input_ids, token_type_ids, input_mask)
    ```
    """

    def __init__(self, config, label_dict_or_path: Union[str, dict]):
        super(BertForSequenceClassification, self).__init__(config)
        if isinstance(label_dict_or_path, dict):
            self.labels = label_dict_or_path
        elif isinstance(label_dict_or_path, str):
            self.labels = load_label_list(label_dict_or_path)
        else:
            raise AttributeError('label_dict_or_label_path must be string type of path or dictonary contains label map')
        self.num_labels = len(self.labels)
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
        self.config = config
        self.min_match_rate = config.min_match_rate
        self.topk = config.topk
        self.pred_step_labels = {value: key for key, value in self.labels.items()}

        self.apply(self.init_bert_weights)

        self.accuracy_score = M.Accuracy(num_classes=self.num_labels, top_k=config.topk)
        self.precision_score = M.Precision(num_classes=self.num_labels, top_k=config.topk)
        self.recall_score = M.Recall(num_classes=self.num_labels, top_k=config.topk)
        self.f1_score = M.F1Score(num_classes=self.num_labels, top_k=config.topk)

        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

    def collect_outputs(
            self,
            stage: str,
            logits: torch.Tensor,
            labels: torch.Tensor,
    ):
        accuracy = self.accuracy_score(logits, labels)
        f1 = self.f1_score(logits, labels)
        precision = self.precision_score(logits, labels)
        recall = self.recall_score(logits, labels)
        criterion = CrossEntropyLoss()
        loss = criterion(logits.view(-1, self.num_labels), labels.view(-1))
        self.info(
            {
                f'{stage}_loss': loss,
                f'{stage}_accuracy': accuracy,
                f'{stage}_precision': precision,
                f'{stage}_recall': recall,
                f'{stage}_f1': f1,
                f'{stage}_lr': self.optimizer.get_lr()[0]
            }
        )
        return {'loss': loss, 'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1}

    def training_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
        input_ids, input_mask, segment_ids, label_ids = batch
        _, pooled_output = self.bert(input_ids, input_mask, segment_ids, output_all_encoded_layers=False)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return self.collect_outputs(
            'train',
            logits=logits,
            labels=label_ids
        )

    def training_epoch_end(self, outputs: Optional[List[Dict[str, torch.Tensor]]] = None, epoch: Optional[int] = None,
                           eval_metrics: Optional[float] = None) \
            -> None:
        return None

    def validation_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
        input_ids, input_mask, segment_ids, label_ids = batch
        _, pooled_output = self.bert(input_ids, input_mask, segment_ids, output_all_encoded_layers=False)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return self.collect_outputs(
            'valid',
            logits=logits,
            labels=label_ids
        )

    def test_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
        input_ids, input_mask, segment_ids, label_ids = batch
        _, pooled_output = self.bert(input_ids, input_mask, segment_ids, output_all_encoded_layers=False)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return self.collect_outputs(
            'test',
            logits=logits,
            labels=label_ids
        )

    def predict_step(self, batch: Tuple, batch_idx: int) -> Dict:
        input_ids, input_mask, segment_ids, _ = batch
        logits = self.forward(input_ids, segment_ids, input_mask)
        if self.topk == 1:
            logit_idxs, log_probs = get_logits(logits.cpu().detach().numpy())
        else:
            logit_idxs, log_probs = get_topk_logits(logits.cpu().detach().numpy(), topk=self.topk,
                                            min_match_rate=self.min_match_rate)

        logits = [self.pred_step_labels[logit_idx] for logit_idx in logit_idxs]
        return {'text': input_ids, 'score': log_probs, 'value': logits}

Klassikcat Jul 19, 2022
Author

This is pretrained model loading boilerplate function.

def load_model(args: argparse.Namespace, tokenizer_path: str, label_dict_or_path: t.Optional[str] = None) -> \
        Union[torch.nn.Module, pl.LightningModule]:
    model_config = R.MODEL_CONFIGS.get(args.model_task, None)
    model = R.MODEL.get(args.model_task, None)

    tokenizer = BertTokenizer.from_pretrained(tokenizer_path)

    if args.model_path_or_instance:
        from DeepReview.module import PYTORCH_PRETRAINED_BERT_CACHE
        cache_dir = os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.accelerator))

        config = model_config.from_json_file(args.bert_config_path)
        config: dict = config.to_dict()
        config.update({
            'warmup_steps': args.warmup_steps,
            'warmup_proportion': args.warmup_proportion
        })
        config.update({
            'lr': args.lr
        })
        if model.__class__.__name__ == 'BertForSentimentAnalysis':
            config.update({
                'embed_dim': 1024,
                'fc_hidden_dim': 768,
            })
        config = model_config.from_dict(config)
        bert_config_dir = '/'.join(args.bert_config_path.split('/')[:-1])
        if args.model_path_or_instance.endswith('.ckpt'):
            model = model.load_from_checkpoint(
                args.model_path_or_instance, config=config, label_dict_or_path=label_dict_or_path
            )
        else:
            state_dict = torch.load(args.model_path_or_instance, map_location='cpu')
            model = model.from_pretrained(
                bert_config_dir, state_dict=state_dict, cache_dir=cache_dir, label_dict_or_path=label_dict_or_path
            )
        json_string = model.config.to_json_string()
        config_save_path = bert_config_dir + '/' + model.__class__.__name__ + '_' + 'config.json'
        with open(config_save_path, 'w') as f:
            json.dump(json_string, f)
    else:
        config: Dict[Union[str, Any], Union[Union[int, str, float], Any]] = {
            'vocab_size_or_config_json_file': len(tokenizer.vocab),
            'hidden_size': 1024,
            'num_hidden_layers': 24,
            'num_attention_heads': 16,
            'intermediate_size': 4096,
            'hidden_act': 'gelu',
            'hidden_dropout_prob': 0.1,
            'attention_probs_dropout_prob': 0.1,
            'max_position_embeddings': 512,
            'type_vocab_size': 2,
            'initializer_range': 0.02
        }
        if config.__class__.__name__ in R.PRETRAINING:
            config.update({
                'min_lr': args.min_lr,
                'peak_lr': args.lr
            })
        else:
            config.update({
                'lr': args.lr
            })
        config.update({
            'warmup_steps': args.warmup_steps,
            'warmup_proportion': args.warmup_proportion,
            'gradient_accumulation_steps': args.gradient_accumulation_steps,
        })
        if model.__class__.__name__ == 'BertForSentimentAnalysis':
            config.update({
                'embed_dim': 1024,
                'fc_hidden_dim': 768,
            })

        config = model_config.from_dict(config)
        if model.__class__.__name__ in ['BertForMaskedLM', 'BertForPreTraining']:
            model = model(config)
        else:
            model = model(config, label_dict_or_path=label_dict_or_path)

    return model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The gradient does not seem to be updated during BERT training. #13741

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

The gradient does not seem to be updated during BERT training. #13741

Klassikcat Jul 19, 2022

Replies: 1 comment · 2 replies

rohitgr7 Jul 19, 2022

Klassikcat Jul 19, 2022 Author

Pre-trained weight loader

BertModel

BertForSequenceClassification

Klassikcat Jul 19, 2022 Author

Klassikcat
Jul 19, 2022

Replies: 1 comment 2 replies

rohitgr7
Jul 19, 2022

Klassikcat Jul 19, 2022
Author

Klassikcat Jul 19, 2022
Author