LongWanjuan

This repository contains the code and data for the paper LongWanjuan: Towards Systematic Measurement for Long Text Quality.

Measurement

To systematically assess the quality of long texts, we adhere to linguistic fundamentals and evaluate them through three dimensions: coherence, cohesion, and complexity.

Coherence refers to the consistency and clarity of the text as a whole. A coherent text maintains thematic unity throughout its parts, with logical connections between the different sections.
Cohesion measures the degree of tight connection between two sentences or sections of the text, reflected in the use of connectives, pronouns, synonyms, and hypernyms/hyponyms.
Complexity assesses the level of linguistic sophistication in the use of language in the text. This can be gauged through the richness and diversity of vocabulary, as well as the complexity of sentence structures.

To better elucidate these dimensions, we provide examples in the following table.

Metric

Inspired by the three dimensions mentioned above, we propose the quantitative metrics to assess the quality of long text, including both statistical and model-based ones, where higher values correlate with more pronounced characteristics of the corresponding dimension. The code for calculating these metric is in coherence, cohesion and complexity.

LongWanjuan

Based on the analysis and metrics discussed previously, we introduce LongWanjuan, a bilingual long-text dataset with over 160B tokens. The pipeline for constructing our dataset is illustrated in the following figure.

An overview of the dataset statistics in the English and Chinese part of LongWanjuan is as follows.

Distribution of token and document counts across different lengths is as follows. In LongWanjuan, over 99.9% of the data exceed the truncation length in pre-training.

LongWanjuan is available at https://opendatalab.com/OpenLMLab/LongWanjuan under CC BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
coherence		coherence
cohesion		cohesion
complexity		complexity
License		License
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongWanjuan

Measurement

Metric

LongWanjuan

About

Releases

Packages

Contributors 2

Languages

License

OpenLMLab/LongWanjuan

Folders and files

Latest commit

History

Repository files navigation

LongWanjuan

Measurement

Metric

LongWanjuan

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages