GitHub

ConsisEval: A Hard-to-Easy Consistency Evaluation
Benchmark for Large Language Models

This repo is for paper Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?
Our code and data will be released soon!

Overview

ConsisEval is developed to systematically evaluate the hard-to-easy consistency of LLMs. Here the hard-to-easy inconsistency refers to the counter-intuitive phenomenons where LLMs, while capable of solving hard problems, can paradoxically fail at easier ones.

ConsisEval includes 732 pair of questions from code (164), mathematics (298), and instruction-following (270) domains. It is noteworthy that there are only pairwise data in ConsisEval: one datum is comprised of two questions (an easy question and a harder one), and there is a strict order of difficulty between these two questions.

Data Collection

Easy data is collected from gsm8k, IFEval and HumanEval.
hard data derived from easy data by automatic generation and human annotation.

Evaluation Metric

Consistency Score (CS): conditional probability of a model correctly answering easy questions provided that it has correctly answered harder ones.

Code and Data

Our code and data will be released soon!

Citation

@misc{yang2024large,
      title={Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?}, 
      author={Zhe Yang and Yichang Zhang and Tianyu Liu and Jian Yang and Junyang Lin and Chang Zhou and Zhifang Sui},
      year={2024},
      eprint={2406.12809},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConsisEval: A Hard-to-Easy Consistency Evaluation
Benchmark for Large Language Models

Overview

Data Collection

Evaluation Metric

Code and Data

Citation

About

Releases

Packages

QwenLM/ConsisEval

Folders and files

Latest commit

History

Repository files navigation

ConsisEval: A Hard-to-Easy Consistency EvaluationBenchmark for Large Language Models

Overview

Data Collection

Evaluation Metric

Code and Data

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

ConsisEval: A Hard-to-Easy Consistency Evaluation
Benchmark for Large Language Models

Packages