Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing results from paper #40

Open
carriex opened this issue Mar 19, 2024 · 17 comments
Open

Reproducing results from paper #40

carriex opened this issue Mar 19, 2024 · 17 comments
Labels
question Further information is requested

Comments

@carriex
Copy link

carriex commented Mar 19, 2024

Hi Jack,

Thanks for the great work and sharing the code! I am trying to reproduce results from the paper and want to confirm if I am doing it correctly.

Specifically I ran the below code

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct",
    use_less_data=-1 # use all data

)

train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

And got the below results
{'eval_loss': 0.6015774011611938, 'eval_pred_num_tokens': 31.0, 'eval_true_num_tokens': 32.0, 'eval_token_set_precision': 0.9518449167645596, 'eval_token_set_recall': 0.9564611513833035, 'eval_token_set_f1': 0.9538292487776809, 'eval_token_set_f1_sem': 0.004178347129611342, 'eval_n_ngrams_match_1': 23.128, 'eval_n_ngrams_match_2': 20.244, 'eval_n_ngrams_match_3': 18.212, 'eval_num_true_words': 24.308, 'eval_num_pred_words': 24.286, 'eval_bleu_score': 83.32868888524891, 'eval_bleu_score_sem': 1.1145241315071208, 'eval_rouge_score': 0.9550079258714326, 'eval_exact_match': 0.578, 'eval_exact_match_sem': 0.022109039310618563, 'eval_emb_cos_sim': 0.9910151958465576, 'eval_emb_cos_sim_sem': 0.0038230661302804947, 'eval_emb_top1_equal': 0.75, 'eval_emb_top1_equal_sem': 0.11180339753627777, 'eval_runtime': 253.6454, 'eval_samples_per_second': 1.971, 'eval_steps_per_second': 0.126}

Are the numbers here supposed to correspond to "GTR - NQ - Vec2Text [20 steps]" in table 1 (row 7)? I think most of the numbers are close except for exact match for which i got a higher number (57.8 v.s. 40.2 in the paper).

Thanks again!

@jxmorris12
Copy link
Owner

Yep, this looks right to me. I think we trained the model for more steps after submission which is why the scores went up a little bit. To get the higher score, you have to set sequence beam width to 8 and the number of steps to 50.

@jxmorris12 jxmorris12 added the question Further information is requested label Mar 19, 2024
@carriex
Copy link
Author

carriex commented Mar 19, 2024

awesome, thanks for the quick response!

@carriex carriex closed this as completed Mar 19, 2024
@carriex
Copy link
Author

carriex commented Mar 19, 2024

One follow-up question -- how are the train / dev split for NQ experiments constructed (are they split randomly at article level or truncated passage level)?

The dev dataset looks like randomly sampled passages from different articles (i.e. the second row is not the continuation of first row).

Screenshot 2024-03-19 at 1 27 19 PM

A bit more background on this is that I am trying to test the model on longer sequences (e.g. 2x length for wikipedia passages) so was thinking of simply concatenating the passages in the dev set (which i think only makes sense if they are consecutive). It seems like there are some experiments in the paper (table 2) that is looking at decoding from length longer than the training sequences. I'd appreciate it if you can provide pointers to how to reproduce some of the results there too!

Thanks a lot!

@jxmorris12
Copy link
Owner

Hi! I took the train and validation sets from DPR (https://arxiv.org/abs/2004.04906 / https://github.com/facebookresearch/DPR). I'll send you a message offline to discuss further.

@jxmorris12
Copy link
Owner

Oh but I don't think table 2 is decoding from any length longer than training sequences. I train on sequences up to 128 and use those for testing too. I never test on embedded sequences of more than 128 tokens, but that sounds really interesting!

@carriex
Copy link
Author

carriex commented Mar 20, 2024

oh i see! is the results in table 2 reported for the model trained for OpenAI embeddings on MSMARCO dataset with a mixed of different sequence lengths (looking at below section) then?

Screenshot 2024-03-19 at 8 25 35 PM

thanks again!

@jxmorris12
Copy link
Owner

yes, the MSMarco longer-sequence-length dataset included sequences from 1 to 128 tokens

@carriex
Copy link
Author

carriex commented Apr 16, 2024

Hi there!

I am trying to reproduce results for OpenAI model trained on MSMARCO (up to 128, last section in table 1). Is the below the correct command/model to run?

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/vec2text__openai_ada002__msmarco__msl128__corrector",
    use_less_data=-1 # use all data

)

train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

I am currently running into some error (hard-coded path not found, etc.), but wanted to make sure this is the right model / set-up to look at. Thanks!

Screenshot 2024-04-16 at 2 08 49 PM

@carriex carriex reopened this Apr 16, 2024
@jxmorris12
Copy link
Owner

Hi @carriex -- this looks right! I'm pretty sure that's the right model. Can you share the error with me? Or maybe we can work out of a Colab to get this figured out. Sorry for the hardcoded path; I'm not sure where it is but I will remove it for you!

@carriex
Copy link
Author

carriex commented Apr 29, 2024

Sorry for the late reply! Here is a colab notebook showing the error.

@jxmorris12
Copy link
Owner

Ok there was something weird with the pre-trained model from HuggingFace which I will look into. For now, I developed a workaround; here's some code that properly loads the hypothesizer model from its pre-trained checkpoint:

import torch

from vec2text.analyze_utils import args_from_config
from vec2text.models.config import InversionConfig
from vec2text.run_args import DataArguments, ModelArguments, TrainingArguments

from vec2text import experiments

def load_experiment_and_trainer_from_pretrained(name: str, use_less_data: int = 1000):
    config = InversionConfig.from_pretrained(name)
    model_args = args_from_config(ModelArguments, config)
    data_args = args_from_config(DataArguments, config)
    training_args = args_from_config(TrainingArguments, config)

    data_args.use_less_data = use_less_data
    #######################################################################
    from accelerate.state import PartialState

    training_args._n_gpu = 1 if torch.cuda.is_available() else 0  # Don't load in DDP
    training_args.bf16 = 0  # no bf16 in case no support from GPU
    training_args.local_rank = -1  # Don't load in DDP
    training_args.distributed_state = PartialState()
    training_args.deepspeed_plugin = None  # For backwards compatibility
    # training_args.dataloader_num_workers = 0  # no multiprocessing :)
    training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer"
    training_args.use_wandb = False
    training_args.report_to = []
    training_args.mock_embedder = False
    training_args.output_dir = "saves/" + name.replace("/", "__")
    ########################################################################

    experiment = experiments.experiment_from_args(
      model_args, 
      data_args, 
      training_args
    )
    trainer = experiment.load_trainer()
    trainer.model = trainer.model.__class__.from_pretrained(name)
    trainer.model.to(training_args.device)
    return experiment, trainer
  
experiment, trainer = load_experiment_and_trainer_from_pretrained(
    "jxm/vec2text__openai_ada002__msmarco__msl128__corrector",
    use_less_data=1000,

)

print(" >>>> test ")
train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)


print(" >>>> loaded datasets ")

trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

@jxmorris12
Copy link
Owner

(The only line I changed was adding this:)

    training_args.corrector_model_from_pretrained = "jxm/vec2text__openai_ada002__msmarco__msl128__hypothesizer"

@Hannibal046
Copy link

Hi, I want to know if this is the right command to reproduce the gtr-nq-32-50iter-sbeam?

from vec2text import analyze_utils

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
     "jxm/gtr__nq__32__correct"
)
train_datasets = experiment._load_train_dataset_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)

val_datasets = experiment._load_val_datasets_uncached(
    model=trainer.model,
    tokenizer=trainer.tokenizer,
    embedder_tokenizer=trainer.embedder_tokenizer
)
trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 4
trainer.num_gen_recursive_steps = 50
trainer.evaluate(
    eval_dataset=train_datasets["validation"]
)

I got this:
image

@jxmorris12
Copy link
Owner

Hmm, the command looks right and the numbers are close but a little low. Oddly the dataset looks different -- I've never seen that example ("Toonimo Toonimo is a...") before. Are you using the proper MSMarco split? Maybe a newer dataset version was uploaded or something else changed that's dropping the score a bit.

Also how many samples are you using from the validation set?

@Hannibal046
Copy link

Hi, thanks for response! If I understand this correctly, "jxm/gtr__nq__32__correct" would be a nq split for test, not msmarco? I didn't change the tested number and it directly load trainer state from "jxm/gtr__nq__32__correct".

To clarify, what is the expected number of this model? Is it the last row in the figure? Thanks in advance.
image

@jxmorris12
Copy link
Owner

jxmorris12 commented Jun 24, 2024

Yep it should be the last number in the figure, the one you highlighted. And you're right -- it should be the NQ validation set (not MSMARCO, my mistake). Something else must have changed between your setup and mine because the numbers in red are correct. I will put some thought into what it may be.

@Hannibal046
Copy link

Hi, @jxmorris12 , do you think this might be relevant?
ielab/vec2text-dense_retriever-threat#1
The default value of return_best_hypothesis is set to False in the snippet above. After manually setting it to True, this is what I got:
image
Looks much better now, but still a little bit lower. I want to know if this is the exact data split reported in the paper: the first 500 samples from jxm/nq_corpus_dpr, dev split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants