-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing results from paper #40
Comments
Yep, this looks right to me. I think we trained the model for more steps after submission which is why the scores went up a little bit. To get the higher score, you have to set sequence beam width to 8 and the number of steps to 50. |
awesome, thanks for the quick response! |
One follow-up question -- how are the train / dev split for NQ experiments constructed (are they split randomly at article level or truncated passage level)? The dev dataset looks like randomly sampled passages from different articles (i.e. the second row is not the continuation of first row). A bit more background on this is that I am trying to test the model on longer sequences (e.g. 2x length for wikipedia passages) so was thinking of simply concatenating the passages in the dev set (which i think only makes sense if they are consecutive). It seems like there are some experiments in the paper (table 2) that is looking at decoding from length longer than the training sequences. I'd appreciate it if you can provide pointers to how to reproduce some of the results there too! Thanks a lot! |
Hi! I took the train and validation sets from DPR (https://arxiv.org/abs/2004.04906 / https://github.com/facebookresearch/DPR). I'll send you a message offline to discuss further. |
Oh but I don't think table 2 is decoding from any length longer than training sequences. I train on sequences up to 128 and use those for testing too. I never test on embedded sequences of more than 128 tokens, but that sounds really interesting! |
yes, the MSMarco longer-sequence-length dataset included sequences from 1 to 128 tokens |
Hi there! I am trying to reproduce results for OpenAI model trained on MSMARCO (up to 128, last section in table 1). Is the below the correct command/model to run?
I am currently running into some error (hard-coded path not found, etc.), but wanted to make sure this is the right model / set-up to look at. Thanks! |
Hi @carriex -- this looks right! I'm pretty sure that's the right model. Can you share the error with me? Or maybe we can work out of a Colab to get this figured out. Sorry for the hardcoded path; I'm not sure where it is but I will remove it for you! |
Sorry for the late reply! Here is a colab notebook showing the error. |
Ok there was something weird with the pre-trained model from HuggingFace which I will look into. For now, I developed a workaround; here's some code that properly loads the hypothesizer model from its pre-trained checkpoint: import torch
from vec2text.analyze_utils import args_from_config
from vec2text.models.config import InversionConfig
from vec2text.run_args import DataArguments, ModelArguments, TrainingArguments
from vec2text import experiments
def load_experiment_and_trainer_from_pretrained(name: str, use_less_data: int = 1000):
config = InversionConfig.from_pretrained(name)
model_args = args_from_config(ModelArguments, config)
data_args = args_from_config(DataArguments, config)
training_args = args_from_config(TrainingArguments, config)
data_args.use_less_data = use_less_data
#######################################################################
from accelerate.state import PartialState
training_args._n_gpu = 1 if torch.cuda.is_available() else 0 # Don't load in DDP
training_args.bf16 = 0 # no bf16 in case no support from GPU
training_args.local_rank = -1 # Don't load in DDP
training_args.distributed_state = PartialState()
training_args.deepspeed_plugin = None # For backwards compatibility
# training_args.dataloader_num_workers = 0 # no multiprocessing :)
training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer"
training_args.use_wandb = False
training_args.report_to = []
training_args.mock_embedder = False
training_args.output_dir = "saves/" + name.replace("/", "__")
########################################################################
experiment = experiments.experiment_from_args(
model_args,
data_args,
training_args
)
trainer = experiment.load_trainer()
trainer.model = trainer.model.__class__.from_pretrained(name)
trainer.model.to(training_args.device)
return experiment, trainer
experiment, trainer = load_experiment_and_trainer_from_pretrained(
"jxm/vec2text__openai_ada002__msmarco__msl128__corrector",
use_less_data=1000,
)
print(" >>>> test ")
train_datasets = experiment._load_train_dataset_uncached(
model=trainer.model,
tokenizer=trainer.tokenizer,
embedder_tokenizer=trainer.embedder_tokenizer
)
print(" >>>> loaded datasets ")
trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
eval_dataset=train_datasets["validation"]
) |
(The only line I changed was adding this:) training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer" |
Hi, I want to know if this is the right command to reproduce the from vec2text import analyze_utils
experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
"jxm/gtr__nq__32__correct"
)
train_datasets = experiment._load_train_dataset_uncached(
model=trainer.model,
tokenizer=trainer.tokenizer,
embedder_tokenizer=trainer.embedder_tokenizer
)
val_datasets = experiment._load_val_datasets_uncached(
model=trainer.model,
tokenizer=trainer.tokenizer,
embedder_tokenizer=trainer.embedder_tokenizer
)
trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 4
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
eval_dataset=train_datasets["validation"]
) |
Hmm, the command looks right and the numbers are close but a little low. Oddly the dataset looks different -- I've never seen that example ( Also how many samples are you using from the validation set? |
Yep it should be the last number in the figure, the one you highlighted. And you're right -- it should be the NQ validation set (not MSMARCO, my mistake). Something else must have changed between your setup and mine because the numbers in red are correct. I will put some thought into what it may be. |
Hi, @jxmorris12 , do you think this might be relevant? |
Hi Jack,
Thanks for the great work and sharing the code! I am trying to reproduce results from the paper and want to confirm if I am doing it correctly.
Specifically I ran the below code
And got the below results
{'eval_loss': 0.6015774011611938, 'eval_pred_num_tokens': 31.0, 'eval_true_num_tokens': 32.0, 'eval_token_set_precision': 0.9518449167645596, 'eval_token_set_recall': 0.9564611513833035, 'eval_token_set_f1': 0.9538292487776809, 'eval_token_set_f1_sem': 0.004178347129611342, 'eval_n_ngrams_match_1': 23.128, 'eval_n_ngrams_match_2': 20.244, 'eval_n_ngrams_match_3': 18.212, 'eval_num_true_words': 24.308, 'eval_num_pred_words': 24.286, 'eval_bleu_score': 83.32868888524891, 'eval_bleu_score_sem': 1.1145241315071208, 'eval_rouge_score': 0.9550079258714326, 'eval_exact_match': 0.578, 'eval_exact_match_sem': 0.022109039310618563, 'eval_emb_cos_sim': 0.9910151958465576, 'eval_emb_cos_sim_sem': 0.0038230661302804947, 'eval_emb_top1_equal': 0.75, 'eval_emb_top1_equal_sem': 0.11180339753627777, 'eval_runtime': 253.6454, 'eval_samples_per_second': 1.971, 'eval_steps_per_second': 0.126}
Are the numbers here supposed to correspond to "GTR - NQ - Vec2Text [20 steps]" in table 1 (row 7)? I think most of the numbers are close except for exact match for which i got a higher number (57.8 v.s. 40.2 in the paper).
Thanks again!
The text was updated successfully, but these errors were encountered: