Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Reproduce Results for LongBench #27

Closed
ilil96 opened this issue Aug 26, 2024 · 2 comments
Closed

Unable to Reproduce Results for LongBench #27

ilil96 opened this issue Aug 26, 2024 · 2 comments

Comments

@ilil96
Copy link

ilil96 commented Aug 26, 2024

Hello,

I ran the code provided for LongBench using the Llama-3-8B-Instruct model but couldn't reproduce the results reported in Table 8 of your paper. Specifically, the full precision baseline model's score for Qasper in my run is 32.11, while the reported score is 44.24.

I used the following command to run the model:
python pred_long_bench.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --k_bits 16 --v_bits 16

Is there anything I might be missing?

@jy-yuan
Copy link
Owner

jy-yuan commented Aug 27, 2024

Hi,

For Llama-3 Instruct models, please add the prompt template as shown here. We've updated the code in pred_long_bench.py accordingly. Please give it a try, and feel free to ask if you have any questions!

Thanks!

@henryzhongsc
Copy link

henryzhongsc commented Sep 14, 2024

The meta-llama/Meta-Llama-3-8B-Instruct results in KIVI's Table 8 are actually borrowed from our recent KV cache compression benchmark paper (https://arxiv.org/abs/2407.01527), and we have just open-sourced its codebase at https://github.com/henryzhongsc/longctx_bench. The KIVI paper was done prior to the release of Llama 3, and we might not have included all the necessary supports for these new models — like the template issue above — in KIVI's public codebase.

In any case, I can confirm that our Table 8 results in KIVI are reproducible via the two scripts (Llama 3 baseline, Llama 3 with KIVI-2). For your convenience, here's the task summary:

Llama-3-8B-Instruct Baseline

{
    "individual_dataset_result": {
        "narrativeqa": 21.71,
        "qasper": 44.24,
        "multifieldqa_en": 44.54,
        "hotpotqa": 46.82,
        "2wikimqa": 36.42,
        "musique": 21.49,
        "gov_report": 30.04,
        "qmsum": 22.57,
        "multi_news": 27.86,
        "trec": 74.5,
        "triviaqa": 90.23,
        "samsum": 42.63,
        "passage_retrieval_en": 67.0,
        "lcc": 57.04,
        "repobench-p": 51.12,
        "passage_count": 7.0
    },
    "task_average_result": {
        "single_doc_qa": 36.83,
        "multi_doc_qa": 34.91,
        "summarization": 26.82,
        "few_shots": 69.12,
        "synthetic": 67.0,
        "code": 54.08
    },
    "LB_average_result": 45.21
}

Llama-3-8B-Instruct with KIVI-2bit

{
    "individual_dataset_result": {
        "narrativeqa": 21.35,
        "qasper": 43.15,
        "multifieldqa_en": 44.23,
        "hotpotqa": 46.79,
        "2wikimqa": 37.05,
        "musique": 20.56,
        "gov_report": 29.77,
        "qmsum": 22.1,
        "multi_news": 27.48,
        "trec": 74.5,
        "triviaqa": 90.54,
        "samsum": 42.48,
        "passage_retrieval_en": 67.5,
        "lcc": 50.84,
        "repobench-p": 46.67,
        "passage_count": 7.0
    },
    "task_average_result": {
        "single_doc_qa": 36.24,
        "multi_doc_qa": 34.8,
        "summarization": 26.45,
        "few_shots": 69.17,
        "synthetic": 67.5,
        "code": 48.76
    },
    "LB_average_result": 44.33
}

(Note we excluded passage_count from average results as this is very much an outlier.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@henryzhongsc @jy-yuan @ilil96 @zirui-ray-liu and others