Reproducing results from paper #40

carriex · 2024-03-19T15:45:57Z

Hi Jack,

Thanks for the great work and sharing the code! I am trying to reproduce results from the paper and want to confirm if I am doing it correctly.

Specifically I ran the below code

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct",
    use_less_data=-1 # use all data

)

train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

And got the below results
{'eval_loss': 0.6015774011611938, 'eval_pred_num_tokens': 31.0, 'eval_true_num_tokens': 32.0, 'eval_token_set_precision': 0.9518449167645596, 'eval_token_set_recall': 0.9564611513833035, 'eval_token_set_f1': 0.9538292487776809, 'eval_token_set_f1_sem': 0.004178347129611342, 'eval_n_ngrams_match_1': 23.128, 'eval_n_ngrams_match_2': 20.244, 'eval_n_ngrams_match_3': 18.212, 'eval_num_true_words': 24.308, 'eval_num_pred_words': 24.286, 'eval_bleu_score': 83.32868888524891, 'eval_bleu_score_sem': 1.1145241315071208, 'eval_rouge_score': 0.9550079258714326, 'eval_exact_match': 0.578, 'eval_exact_match_sem': 0.022109039310618563, 'eval_emb_cos_sim': 0.9910151958465576, 'eval_emb_cos_sim_sem': 0.0038230661302804947, 'eval_emb_top1_equal': 0.75, 'eval_emb_top1_equal_sem': 0.11180339753627777, 'eval_runtime': 253.6454, 'eval_samples_per_second': 1.971, 'eval_steps_per_second': 0.126}

Are the numbers here supposed to correspond to "GTR - NQ - Vec2Text [20 steps]" in table 1 (row 7)? I think most of the numbers are close except for exact match for which i got a higher number (57.8 v.s. 40.2 in the paper).

Thanks again!

The text was updated successfully, but these errors were encountered:

jxmorris12 · 2024-03-19T16:19:26Z

Yep, this looks right to me. I think we trained the model for more steps after submission which is why the scores went up a little bit. To get the higher score, you have to set sequence beam width to 8 and the number of steps to 50.

carriex · 2024-03-19T16:21:23Z

awesome, thanks for the quick response!

carriex · 2024-03-19T18:30:33Z

One follow-up question -- how are the train / dev split for NQ experiments constructed (are they split randomly at article level or truncated passage level)?

The dev dataset looks like randomly sampled passages from different articles (i.e. the second row is not the continuation of first row).

A bit more background on this is that I am trying to test the model on longer sequences (e.g. 2x length for wikipedia passages) so was thinking of simply concatenating the passages in the dev set (which i think only makes sense if they are consecutive). It seems like there are some experiments in the paper (table 2) that is looking at decoding from length longer than the training sequences. I'd appreciate it if you can provide pointers to how to reproduce some of the results there too!

Thanks a lot!

jxmorris12 · 2024-03-20T01:16:40Z

Hi! I took the train and validation sets from DPR (https://arxiv.org/abs/2004.04906 / https://github.com/facebookresearch/DPR). I'll send you a message offline to discuss further.

jxmorris12 · 2024-03-20T01:20:12Z

Oh but I don't think table 2 is decoding from any length longer than training sequences. I train on sequences up to 128 and use those for testing too. I never test on embedded sequences of more than 128 tokens, but that sounds really interesting!

carriex · 2024-03-20T01:29:45Z

oh i see! is the results in table 2 reported for the model trained for OpenAI embeddings on MSMARCO dataset with a mixed of different sequence lengths (looking at below section) then?

thanks again!

jxmorris12 · 2024-03-20T01:42:20Z

yes, the MSMarco longer-sequence-length dataset included sequences from 1 to 128 tokens

carriex · 2024-04-16T19:10:31Z

Hi there!

I am trying to reproduce results for OpenAI model trained on MSMARCO (up to 128, last section in table 1). Is the below the correct command/model to run?

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/vec2text__openai_ada002__msmarco__msl128__corrector",
    use_less_data=-1 # use all data

)

train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

I am currently running into some error (hard-coded path not found, etc.), but wanted to make sure this is the right model / set-up to look at. Thanks!

jxmorris12 · 2024-04-17T22:06:32Z

Hi @carriex -- this looks right! I'm pretty sure that's the right model. Can you share the error with me? Or maybe we can work out of a Colab to get this figured out. Sorry for the hardcoded path; I'm not sure where it is but I will remove it for you!

carriex · 2024-04-29T16:30:34Z

Sorry for the late reply! Here is a colab notebook showing the error.

jxmorris12 · 2024-04-29T18:47:29Z

Ok there was something weird with the pre-trained model from HuggingFace which I will look into. For now, I developed a workaround; here's some code that properly loads the hypothesizer model from its pre-trained checkpoint:

import torch

from vec2text.analyze_utils import args_from_config
from vec2text.models.config import InversionConfig
from vec2text.run_args import DataArguments, ModelArguments, TrainingArguments

from vec2text import experiments

def load_experiment_and_trainer_from_pretrained(name: str, use_less_data: int = 1000):
    config = InversionConfig.from_pretrained(name)
    model_args = args_from_config(ModelArguments, config)
    data_args = args_from_config(DataArguments, config)
    training_args = args_from_config(TrainingArguments, config)

    data_args.use_less_data = use_less_data
    #######################################################################
    from accelerate.state import PartialState

    training_args._n_gpu = 1 if torch.cuda.is_available() else 0  # Don't load in DDP
    training_args.bf16 = 0  # no bf16 in case no support from GPU
    training_args.local_rank = -1  # Don't load in DDP
    training_args.distributed_state = PartialState()
    training_args.deepspeed_plugin = None  # For backwards compatibility
    # training_args.dataloader_num_workers = 0  # no multiprocessing :)
    training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer"
    training_args.use_wandb = False
    training_args.report_to = []
    training_args.mock_embedder = False
    training_args.output_dir = "saves/" + name.replace("/", "__")
    ########################################################################

    experiment = experiments.experiment_from_args(
      model_args, 
      data_args, 
      training_args
    )
    trainer = experiment.load_trainer()
    trainer.model = trainer.model.__class__.from_pretrained(name)
    trainer.model.to(training_args.device)
    return experiment, trainer
  
experiment, trainer = load_experiment_and_trainer_from_pretrained(
    "jxm/vec2text__openai_ada002__msmarco__msl128__corrector",
    use_less_data=1000,

)

print(" >>>> test ")
train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


print(" >>>> loaded datasets ")

trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

jxmorris12 · 2024-04-29T18:47:56Z

(The only line I changed was adding this:)

    training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer"

Hannibal046 · 2024-06-22T13:21:48Z

Hi, I want to know if this is the right command to reproduce the gtr-nq-32-50iter-sbeam?

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct"
)
train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)

val_datasets = experiment._load_val_datasets_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)
trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 4
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

I got this:

jxmorris12 · 2024-06-24T16:56:33Z

Hmm, the command looks right and the numbers are close but a little low. Oddly the dataset looks different -- I've never seen that example ("Toonimo Toonimo is a...") before. Are you using the proper MSMarco split? Maybe a newer dataset version was uploaded or something else changed that's dropping the score a bit.

Also how many samples are you using from the validation set?

Hannibal046 · 2024-06-24T18:17:28Z

Hi, thanks for response! If I understand this correctly, "jxm/gtr__nq__32__correct" would be a nq split for test, not msmarco? I didn't change the tested number and it directly load trainer state from "jxm/gtr__nq__32__correct".

To clarify, what is the expected number of this model? Is it the last row in the figure? Thanks in advance.

jxmorris12 · 2024-06-24T20:56:35Z

Yep it should be the last number in the figure, the one you highlighted. And you're right -- it should be the NQ validation set (not MSMARCO, my mistake). Something else must have changed between your setup and mine because the numbers in red are correct. I will put some thought into what it may be.

Hannibal046 · 2024-06-27T05:17:59Z

Hi, @jxmorris12 , do you think this might be relevant?
ielab/vec2text-dense_retriever-threat#1
The default value of return_best_hypothesis is set to False in the snippet above. After manually setting it to True, this is what I got:

Looks much better now, but still a little bit lower. I want to know if this is the exact data split reported in the paper: the first 500 samples from jxm/nq_corpus_dpr, dev split.

jxmorris12 added the question Further information is requested label Mar 19, 2024

carriex closed this as completed Mar 19, 2024

carriex reopened this Apr 16, 2024

jxmorris12 mentioned this issue Jun 24, 2024

about demo #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing results from paper #40

Reproducing results from paper #40

carriex commented Mar 19, 2024

jxmorris12 commented Mar 19, 2024

carriex commented Mar 19, 2024

carriex commented Mar 19, 2024

jxmorris12 commented Mar 20, 2024

jxmorris12 commented Mar 20, 2024

carriex commented Mar 20, 2024 •

edited

Loading

jxmorris12 commented Mar 20, 2024

carriex commented Apr 16, 2024 •

edited

Loading

jxmorris12 commented Apr 17, 2024

carriex commented Apr 29, 2024

jxmorris12 commented Apr 29, 2024

jxmorris12 commented Apr 29, 2024

Hannibal046 commented Jun 22, 2024

jxmorris12 commented Jun 24, 2024

Hannibal046 commented Jun 24, 2024

jxmorris12 commented Jun 24, 2024 •

edited

Loading

Hannibal046 commented Jun 27, 2024

Reproducing results from paper #40

Reproducing results from paper #40

Comments

carriex commented Mar 19, 2024

jxmorris12 commented Mar 19, 2024

carriex commented Mar 19, 2024

carriex commented Mar 19, 2024

jxmorris12 commented Mar 20, 2024

jxmorris12 commented Mar 20, 2024

carriex commented Mar 20, 2024 • edited Loading

jxmorris12 commented Mar 20, 2024

carriex commented Apr 16, 2024 • edited Loading

jxmorris12 commented Apr 17, 2024

carriex commented Apr 29, 2024

jxmorris12 commented Apr 29, 2024

jxmorris12 commented Apr 29, 2024

Hannibal046 commented Jun 22, 2024

jxmorris12 commented Jun 24, 2024

Hannibal046 commented Jun 24, 2024

jxmorris12 commented Jun 24, 2024 • edited Loading

Hannibal046 commented Jun 27, 2024

carriex commented Mar 20, 2024 •

edited

Loading

carriex commented Apr 16, 2024 •

edited

Loading

jxmorris12 commented Jun 24, 2024 •

edited

Loading