- This repo is for paper Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?
- Our code and data will be released soon!
ConsisEval is developed to systematically evaluate the hard-to-easy consistency of LLMs. Here the hard-to-easy inconsistency refers to the counter-intuitive phenomenons where LLMs, while capable of solving hard problems, can paradoxically fail at easier ones.
ConsisEval includes 732 pair of questions from code (164), mathematics (298), and instruction-following (270) domains. It is noteworthy that there are only pairwise data in ConsisEval: one datum is comprised of two questions (an easy question and a harder one), and there is a strict order of difficulty between these two questions.
- Easy data is collected from gsm8k, IFEval and HumanEval.
- hard data derived from easy data by automatic generation and human annotation.
- Consistency Score (CS): conditional probability of a model correctly answering easy questions provided that it has correctly answered harder ones.
- Our code and data will be released soon!
@misc{yang2024large,
title={Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?},
author={Zhe Yang and Yichang Zhang and Tianyu Liu and Jian Yang and Junyang Lin and Chang Zhou and Zhifang Sui},
year={2024},
eprint={2406.12809},
archivePrefix={arXiv},
primaryClass={cs.CL}
}