-
Notifications
You must be signed in to change notification settings - Fork 3.4k
The gradient does not seem to be updated during BERT training. #13741
Replies: 1 comment · 2 replies
-
just fyi!
just in case you are using pre-trained model, are you sure it's loaded correctly? if it still doesn't work, can you share both the codes to check more? |
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for your answer. As you pointed out, I'm using pre-trained bert weights according to below codes. It worked fine before i switched into pytorch-lightning. i checked bert weight path in namespace, and it seems pre-trained state_dict loaded well. (note: pre-trained model is not pytorch-lightning checkpoint. it's normal pytorch state_dict.) Pre-trained weight loaderhere's my pre-trained model loading codes. class PreTrainedBertModel(pl.LightningModule):
""" An abstract class to handle weights initialization and
a simple interface for dowloading and loading pretrained trainers.
"""
def __init__(self, config, *inputs, **kwargs):
super(PreTrainedBertModel, self).__init__()
if not isinstance(config, BertConfig):
raise ValueError(
"Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
"To create a model from a Google pretrained model use "
"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
self.__class__.__name__, self.__class__.__name__
))
self.config = config
def init_bert_weights(self, module):
""" Initialize the weights.
"""
if isinstance(module, (nn.Linear, nn.Embedding)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
elif isinstance(module, BertLayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
if isinstance(module, nn.Linear) and module.bias is not None:
module.bias.data.zero_()
@classmethod
def from_pretrained(cls, pretrained_model_name, state_dict=None, cache_dir=None, dictconfig: dict = None, *inputs, **kwargs):
"""
Instantiate a PreTrainedBertModel from a pre-trained model file or a pytorch state dict.
Download and cache the pre-trained model file if needed.
Params:
pretrained_model_name: either:
- a str with the trainers of a pre-trained model to load selected in the list of:
. `bert-base-uncased`
. `bert-large-uncased`
. `bert-base-cased`
. `bert-large-cased`
. `bert-base-multilingual-uncased`
. `bert-base-multilingual-cased`
. `bert-base-chinese`
- a path or url to a pretrained model archive containing:
. `bert_config.json` a configuration file for the model
. `pytorch_model.vocabs` a PyTorch dump of a BertForPreTraining instance
cache_dir: an optional path to a folder in which the pre-trained trainers will be cached.
state_dict: an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained trainers
*inputs, **kwargs: additional input for the specific Bert class
(ex: num_labels for BertForSequenceClassification)
"""
model_config_map = {
'PreTrainedBertModel': BertConfig,
'BertModel': BertConfig,
'BertForPreTraining': BertConfig,
'BertForMaskedLM': BertConfig,
'BertForSequenceClassification': BertForSequenceClassificationConfig,
'BertForSentimentAnalysis': BertForSentimentAnalysisConfig,
'BertForNextSentencePrediction': BertConfig,
'BertForTokenClassification': BertConfig,
'BertForQuestionAnswering': BertConfig,
'BertForMultipleChoice': BertConfig,
}
bert_configuration = model_config_map.get(cls.__name__)
if pretrained_model_name in PRETRAINED_MODEL_ARCHIVE_MAP:
archive_file = PRETRAINED_MODEL_ARCHIVE_MAP[pretrained_model_name]
else:
archive_file = pretrained_model_name
# redirect to the cache, if necessary
try:
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir)
except FileNotFoundError:
logger.error(
"Model trainers '{}' was not found in model trainers list ({}). "
"We assumed '{}' was a path or url but couldn't find any file "
"associated to this path or url.".format(
pretrained_model_name,
', '.join(PRETRAINED_MODEL_ARCHIVE_MAP.keys()),
archive_file))
return None
if resolved_archive_file == archive_file:
logger.info("loading archive file {}".format(archive_file))
else:
logger.info("loading archive file {} from cache at {}".format(
archive_file, resolved_archive_file))
tempdir = None
if os.path.isdir(resolved_archive_file):
serialization_dir = resolved_archive_file
else:
# Extract archive to temp dir
tempdir = tempfile.mkdtemp()
logger.info("extracting archive file {} to temp dir {}".format(
resolved_archive_file, tempdir))
with tarfile.open(resolved_archive_file, 'r:gz') as archive:
archive.extractall(tempdir)
serialization_dir = tempdir
# Load config
if isinstance(dictconfig, dict):
config = bert_configuration.from_dict(dictconfig)
else:
config_file = os.path.join(serialization_dir, CONFIG_NAME)
config = bert_configuration.from_json_file(config_file)
logger.info("Model config {}".format(config))
# Instantiate model.
model = cls(config, *inputs, **kwargs)
if state_dict is None:
weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
state_dict = torch.load(weights_path)
old_keys = []
new_keys = []
for key in state_dict.keys():
new_key = None
if 'gamma' in key:
new_key = key.replace('gamma', 'weight')
if 'beta' in key:
new_key = key.replace('beta', 'bias')
if new_key:
old_keys.append(key)
new_keys.append(new_key)
for old_key, new_key in zip(old_keys, new_keys):
state_dict[new_key] = state_dict.pop(old_key)
missing_keys = []
unexpected_keys = []
error_msgs = []
# copy state_dict so _load_from_state_dict can modify it
metadata = getattr(state_dict, '_metadata', None)
state_dict = state_dict.copy()
if metadata is not None:
state_dict._metadata = metadata
def load(module, prefix=''):
local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
module._load_from_state_dict(
state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
for name, child in module._modules.items():
if child is not None:
load(child, prefix + name + '.')
load(model, prefix='' if hasattr(model, 'bert') else 'bert.')
if len(missing_keys) > 0:
logger.info("Weights of {} not initialized from pretrained model: {}".format(
model.__class__.__name__, missing_keys))
if len(unexpected_keys) > 0:
logger.info("Weights from pretrained model not used in {}: {}".format(
model.__class__.__name__, unexpected_keys))
if tempdir:
# Clean up temp dir
shutil.rmtree(tempdir)
return model This class also contains these methods down below. def info(self, dictionary: dict) -> None:
for key, value in dictionary.items():
self.log(key, value, prog_bar=True)
def configure_optimizers(self) -> Dict[str, Union[BertAdam, WarmupReduceLROnPlateauScheduler]]:
"""
Configure the optimizer and scheduler.
Returns:
A dictionary containing optimizer and scheduler.
"""
self.num_train_optimization_steps: int = int(
self.trainer.estimated_stepping_batches / self.config.gradient_accumulation_steps
)
print("num_training_steps: ", self.trainer.estimated_stepping_batches)
param_optimizer = list(self.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
'weight_decay': 0.0}
]
self.optimizer = BertAdam(
optimizer_grouped_parameters,
lr=self.config.lr,
warmup=self.config.warmup_proportion,
t_total=self.num_train_optimization_steps
)
return {'optimizer': self.optimizer}
def training_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
"""
Training step. Called for each train batch. May be used to update the model or
do any other training specific things.
Args:
batch: Tuple of input and target data.
batch_idx: Index of the batch.
output:
A dictionary containing logits.
"""
raise NotImplementedError
def validation_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
"""
Validation step. Called for each validation batch. May be used to evaluate the model or
do any other validation specific things.
Args:
batch: Tuple of input and target data.
batch_idx: Index of the batch.
output:
A dictionary containing logits.
"""
raise NotImplementedError
def test_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
"""
Test step. Called for each test batch. May be used to evaluate the model or
do any other test specific things.
Args:
batch: Tuple of input and target data.
batch_idx: Index of the batch.
output:
A dictionary containing logits.
"""
raise NotImplementedError
def predict_step(self, batch: Tuple, batch_idx: int) -> Dict:
"""
predict step. Called for each predict batch. May be used to predict output of the input batch or
do any other predict specific things.
Args:
batch: Tuple of input and target data.
batch_idx: Index of the batch.
output:
A dictionary containing results.
"""
raise NotImplementedError
def training_epoch_end(
self,
outputs: Optional[List[Dict[str, torch.Tensor]]] = None,
epoch: Optional[int] = None,
eval_metrics: Optional[float] = None
) -> None:
self.scheduler_step(epoch + 1, epoch, eval_metrics)
@staticmethod
def get_logits(logits: np.array) -> Tuple[list, list]:
"""
get logits according to the hyperparameter from __call__ method.
Args:
logits: logits from classification. type: torch.tensor
Returns:
logits: logits from classification. type: torch.Tensor
conf: confidence from classification. type: torch.Tensor
"""
logit = int(np.argmax(logits))
conf_init = int(np.max(logits))
logits = [logit]
conf = [conf_init]
return logits, conf
@staticmethod
def get_topk_logits(logits, topk: int = 2, min_match_rate: float = .3) -> Tuple[list, list]:
"""
get logits according to the hyperparameter from __call__ method.
Args:
logits: logits from classification. type: torch.tensor
topk: topk from classification. type: int
min_match_rate: min_match_rate from classification. type: float
Returns:
logits: logits from classification. type: torch.Tensor
conf: confidence from classification. type: torch.Tensor
"""
match_min = list()
for logit in logits:
if logit > min_match_rate:
match_min.append(logit)
if len(match_min) == 0:
return list(np.argmax(logits)), list(np.max(logits))
else:
topk = topk - (topk - len(match_min)) if topk > len(match_min) else topk
topk_dict = {lgt: idx for idx, lgt in logits}
match_min = logits[:topk].sort(reverse=True)
logits = [topk_dict[logit] for logit in match_min]
return logits, match_min
def get_params(self, get_name: bool = False) -> None:
"""
get parameters from model.
Args:
get_name: get name from model. type: bool
Returns:
None
"""
if get_name is True:
condition = self.named_parameters()
else:
condition = self.parameters()
for param in condition:
print(param)
def freeze_params(self):
"""
freeze model encoder according to the hyperparameter from __init__ method.
"""
freeze_params = self.config.freeze_param_num
model_params = 0
for child in self.children():
for _ in child.parameters():
model_params += 1
assert freeze_params < model_params
freeze_num = 0
for child in self.children():
for param in child.parameters():
if freeze_num == freeze_params:
break
param.requires_grad = False
freeze_num += 1
def __initialize_training_step__(self):
"""
initialize all global variables can be different in each models.
for example, label_map, tokenizer, etc.
"""
if self.config.freeze_param_num >= 1:
self.freeze_params()
def forward(self, *args):
"""
Forward propagation method.
"""
raise NotImplementedError Here's my BertModel and BertForSequenceClassification Code. BertModelBertModel does not contain training_step, validation_step, test_step and predict_step. Could this be a reason of the problem I'm currently facing? class BertModel(PreTrainedBertModel):
"""BERT model ("Bidirectional Embedding Representations from a Transformer").
Params:
config: a BertConfig class instance with the configuration to build a new model
Inputs:
`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
`extract_features.py`, `run_classifier.py` and `run_squad.py`)
`token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
Outputs: Tuple of (encoded_layers, pooled_output)
`encoded_layers`: controled by `output_all_encoded_layers` argument:
- `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
- `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
to the last attention block of shape [batch_size, sequence_length, hidden_size],
`pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
classifier pretrained on top of the hidden state associated to the first character of the
input (`CLF`) to trainers on the Next-Sentence task (see BERT's paper).
Example usage:
```python
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
model = modeling.BertModel(config=config)
all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
```
"""
def __init__(self, config):
super(BertModel, self).__init__(config)
self.embeddings = BertEmbeddings(config)
self.encoder = BertEncoder(config)
self.pooler = BertPooler(config)
self.apply(self.init_bert_weights)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=True):
if attention_mask is None:
attention_mask = torch.ones_like(input_ids)
if token_type_ids is None:
token_type_ids = torch.zeros_like(input_ids)
# We create a 3D attention mask from a 2D tensor mask.
# Sizes are [batch_size, 1, 1, to_seq_length]
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
# this attention mask is more simple than the triangular masking of causal attention
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
embedding_output = self.embeddings(input_ids, token_type_ids)
encoded_layers = self.encoder(embedding_output, extended_attention_mask,
output_all_encoded_layers=output_all_encoded_layers)
sequence_output = encoded_layers[-1]
pooled_output = self.pooler(sequence_output)
if not output_all_encoded_layers:
encoded_layers = encoded_layers[-1]
return encoded_layers, pooled_output BertForSequenceClassificationUnlike BertModel, BertForSequenceClassification contains training_step, validation_step, test_step and predict_step. class BertForSequenceClassification(PreTrainedBertModel):
"""BERT model for classification.
This module is composed of the BERT model with a linear layer on top of
the pooled output.
Params:
`config`: a BertConfig class instance with the configuration to build a new model.
`label_dict_or_path`: Dictionary or path to a file containing a label dictionary.
Inputs:
`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
`extract_features.py`, `run_classifier.py` and `run_squad.py`)
`token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
with indices selected in [0, ..., num_labels].
Outputs:
if `labels` is not `None`:
Outputs the CrossEntropy classification loss of the output with the labels.
if `labels` is `None`:
Outputs the classification logits of shape [batch_size, num_labels].
Example usage:
```python
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
label_dict_or_path = {'negative': 0, 'positive': 1}
model = BertForSequenceClassification(config, num_labels)
logits = model(input_ids, token_type_ids, input_mask)
```
"""
def __init__(self, config, label_dict_or_path: Union[str, dict]):
super(BertForSequenceClassification, self).__init__(config)
if isinstance(label_dict_or_path, dict):
self.labels = label_dict_or_path
elif isinstance(label_dict_or_path, str):
self.labels = load_label_list(label_dict_or_path)
else:
raise AttributeError('label_dict_or_label_path must be string type of path or dictonary contains label map')
self.num_labels = len(self.labels)
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, self.num_labels)
self.config = config
self.min_match_rate = config.min_match_rate
self.topk = config.topk
self.pred_step_labels = {value: key for key, value in self.labels.items()}
self.apply(self.init_bert_weights)
self.accuracy_score = M.Accuracy(num_classes=self.num_labels, top_k=config.topk)
self.precision_score = M.Precision(num_classes=self.num_labels, top_k=config.topk)
self.recall_score = M.Recall(num_classes=self.num_labels, top_k=config.topk)
self.f1_score = M.F1Score(num_classes=self.num_labels, top_k=config.topk)
self.apply(self.init_bert_weights)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
_, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return logits
def collect_outputs(
self,
stage: str,
logits: torch.Tensor,
labels: torch.Tensor,
):
accuracy = self.accuracy_score(logits, labels)
f1 = self.f1_score(logits, labels)
precision = self.precision_score(logits, labels)
recall = self.recall_score(logits, labels)
criterion = CrossEntropyLoss()
loss = criterion(logits.view(-1, self.num_labels), labels.view(-1))
self.info(
{
f'{stage}_loss': loss,
f'{stage}_accuracy': accuracy,
f'{stage}_precision': precision,
f'{stage}_recall': recall,
f'{stage}_f1': f1,
f'{stage}_lr': self.optimizer.get_lr()[0]
}
)
return {'loss': loss, 'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1}
def training_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
input_ids, input_mask, segment_ids, label_ids = batch
_, pooled_output = self.bert(input_ids, input_mask, segment_ids, output_all_encoded_layers=False)
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return self.collect_outputs(
'train',
logits=logits,
labels=label_ids
)
def training_epoch_end(self, outputs: Optional[List[Dict[str, torch.Tensor]]] = None, epoch: Optional[int] = None,
eval_metrics: Optional[float] = None) \
-> None:
return None
def validation_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
input_ids, input_mask, segment_ids, label_ids = batch
_, pooled_output = self.bert(input_ids, input_mask, segment_ids, output_all_encoded_layers=False)
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return self.collect_outputs(
'valid',
logits=logits,
labels=label_ids
)
def test_step(self, batch: Tuple, batch_idx: int) -> Dict[str, torch.Tensor]:
input_ids, input_mask, segment_ids, label_ids = batch
_, pooled_output = self.bert(input_ids, input_mask, segment_ids, output_all_encoded_layers=False)
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return self.collect_outputs(
'test',
logits=logits,
labels=label_ids
)
def predict_step(self, batch: Tuple, batch_idx: int) -> Dict:
input_ids, input_mask, segment_ids, _ = batch
logits = self.forward(input_ids, segment_ids, input_mask)
if self.topk == 1:
logit_idxs, log_probs = get_logits(logits.cpu().detach().numpy())
else:
logit_idxs, log_probs = get_topk_logits(logits.cpu().detach().numpy(), topk=self.topk,
min_match_rate=self.min_match_rate)
logits = [self.pred_step_labels[logit_idx] for logit_idx in logit_idxs]
return {'text': input_ids, 'score': log_probs, 'value': logits} |
Beta Was this translation helpful? Give feedback.
All reactions
-
This is pretrained model loading boilerplate function. def load_model(args: argparse.Namespace, tokenizer_path: str, label_dict_or_path: t.Optional[str] = None) -> \
Union[torch.nn.Module, pl.LightningModule]:
model_config = R.MODEL_CONFIGS.get(args.model_task, None)
model = R.MODEL.get(args.model_task, None)
tokenizer = BertTokenizer.from_pretrained(tokenizer_path)
if args.model_path_or_instance:
from DeepReview.module import PYTORCH_PRETRAINED_BERT_CACHE
cache_dir = os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.accelerator))
config = model_config.from_json_file(args.bert_config_path)
config: dict = config.to_dict()
config.update({
'warmup_steps': args.warmup_steps,
'warmup_proportion': args.warmup_proportion
})
config.update({
'lr': args.lr
})
if model.__class__.__name__ == 'BertForSentimentAnalysis':
config.update({
'embed_dim': 1024,
'fc_hidden_dim': 768,
})
config = model_config.from_dict(config)
bert_config_dir = '/'.join(args.bert_config_path.split('/')[:-1])
if args.model_path_or_instance.endswith('.ckpt'):
model = model.load_from_checkpoint(
args.model_path_or_instance, config=config, label_dict_or_path=label_dict_or_path
)
else:
state_dict = torch.load(args.model_path_or_instance, map_location='cpu')
model = model.from_pretrained(
bert_config_dir, state_dict=state_dict, cache_dir=cache_dir, label_dict_or_path=label_dict_or_path
)
json_string = model.config.to_json_string()
config_save_path = bert_config_dir + '/' + model.__class__.__name__ + '_' + 'config.json'
with open(config_save_path, 'w') as f:
json.dump(json_string, f)
else:
config: Dict[Union[str, Any], Union[Union[int, str, float], Any]] = {
'vocab_size_or_config_json_file': len(tokenizer.vocab),
'hidden_size': 1024,
'num_hidden_layers': 24,
'num_attention_heads': 16,
'intermediate_size': 4096,
'hidden_act': 'gelu',
'hidden_dropout_prob': 0.1,
'attention_probs_dropout_prob': 0.1,
'max_position_embeddings': 512,
'type_vocab_size': 2,
'initializer_range': 0.02
}
if config.__class__.__name__ in R.PRETRAINING:
config.update({
'min_lr': args.min_lr,
'peak_lr': args.lr
})
else:
config.update({
'lr': args.lr
})
config.update({
'warmup_steps': args.warmup_steps,
'warmup_proportion': args.warmup_proportion,
'gradient_accumulation_steps': args.gradient_accumulation_steps,
})
if model.__class__.__name__ == 'BertForSentimentAnalysis':
config.update({
'embed_dim': 1024,
'fc_hidden_dim': 768,
})
config = model_config.from_dict(config)
if model.__class__.__name__ in ['BertForMaskedLM', 'BertForPreTraining']:
model = model(config)
else:
model = model(config, label_dict_or_path=label_dict_or_path)
return model |
Beta Was this translation helpful? Give feedback.
-
My ENV
Hello, my dear pytorch-lightning community members!
I'm using Pytorch-lightning for training BERT based on huggingface transformers BERT. It has trained well when i'm using my custom training boilerplate code. After i replace my boilerplate codes to pytorch-lightning, however, the model is not converging(in other words, loss is not improving.)
I guess the reason of this is optimizer code(BertAdam). BertAdam internally contains scheduler, so i checked configure_optimizer. I found num_train_optimzation_steps were wrong and learning rate set too low. so I set learning rate higher(2e-5) and changed num_train_optimization_steps from
into
self.trainer.estimated_stepping_batches / self.config.gradient_accumulation_steps
However, loss is still not converging.
The question is,
Beta Was this translation helpful? Give feedback.
All reactions