Skip to content

Latest commit

 

History

History
102 lines (71 loc) · 3.05 KB

evaluation.md

File metadata and controls

102 lines (71 loc) · 3.05 KB

LLM Evaluation

 

Using lm-evaluation-harness

You can evaluate LitGPT using EleutherAI's lm-eval framework with a large number of different evaluation tasks.

You need to install the lm-eval framework first:

pip install lm_eval

 

Evaluating LitGPT base models

Suppose you downloaded a base model that we want to evaluate. Here, we use the microsoft/phi-2 model:

litgpt download microsoft/phi-2

The download command above will save the model to the checkpoints/microsoft/phi-2 directory, which we can specify in the following evaluation command:

litgpt evaluate checkpoints/microsoft/phi-2/ \
  --batch_size 4 \
  --tasks "hellaswag,truthfulqa_mc2,mmlu" \
  --out_dir evaluate_model/

The resulting output is as follows:

...
|---------------------------------------|-------|------|-----:|--------|-----:|---|-----:|
...
|truthfulqa_mc2                         |      2|none  |     0|acc     |0.4656|±  |0.0164|
|hellaswag                              |      1|none  |     0|acc     |0.2569|±  |0.0044|
|                                       |       |none  |     0|acc_norm|0.2632|±  |0.0044|

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2434|±  |0.0036|
| - humanities     |N/A    |none  |     0|acc   |0.2578|±  |0.0064|
| - other          |N/A    |none  |     0|acc   |0.2401|±  |0.0077|
| - social_sciences|N/A    |none  |     0|acc   |0.2301|±  |0.0076|
| - stem           |N/A    |none  |     0|acc   |0.2382|±  |0.0076|

Please note that the litgpt evaluate command run an internal model conversion. This is only necessary the first time you want to evaluate a model, and it will skip the conversion steps if you run the litgpt evaluate on the same checkpint directory again.

In some cases, for example, if you modified the model in the checkpoint_dir since the first litgpt evaluate call, you need to use the --force_conversion flag to to update the files used by litgpt evaluate accordingly:

litgpt evaluate checkpoints/microsoft/phi-2/ \
  --batch_size 4 \
  --out_dir evaluate_model/ \
  --tasks "hellaswag,truthfulqa_mc2,mmlu" \
  --force_conversion true

 

Tip

Run litgpt evaluate ... without specifying --tasks to print a list of the supported tasks.

Tip

The evaluation may take a long time, and for testing purpoes, you may want to reduce the number of tasks or set a limit for the number of examples per task, for example, --limit 10.

 

Evaluating LoRA-finetuned LLMs

No further conversion is necessary when evaluating LoRA-finetuned models as the finetune_lora command already prepares the necessary merged model files:

litgpt finetune_lora checkpoints/microsoft/phi-2 \
  --out_dir lora_model

 

litgpt evaluate lora_model/final \
  --batch_size 4 \
  --tasks "hellaswag,truthfulqa_mc2,mmlu" \
  --out_dir evaluate_model/ \