Corpus containing 6001 Enron email threads weakly-annotated for entity coreference resolution task. The actual emails can be downloaded from here.
More details are available in our paper (which should be cited if you use or discuss CEREC in your work). An updated copy has been published here.
@inproceedings{dakle-moldovan-2020-cerec, title = "{CEREC}: A Corpus for Entity Resolution in Email Conversations", author = "Dakle, Parag Pravin and Moldovan, Dan", booktitle = "Proceedings of the 28th International Conference on Computational Linguistics", month = dec, year = "2020", address = "Barcelona, Spain (Online)", publisher = "International Committee on Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.coling-main.30", pages = "339--349", abstract = "We present the first large scale corpus for entity resolution in email conversations (CEREC). The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 38,996 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort. Experiments are carried out for evaluating different features and performance of four baselines on the created corpus. For the task of mention identification and coreference resolution, a best performance of 54.1 F1 is reported, highlighting the room for improvement. An in-depth qualitative and quantitative error analysis is presented to understand the limitations of the baselines considered.", }
For the updated copy of paper with corrected numbers use the following citation:
@article{dakle-moldovan-2020-cerec, title={CEREC: A Corpus for Entity Resolution in Email Conversations}, url={http://dx.doi.org/10.18653/v1/2020.coling-main.30}, DOI={10.18653/v1/2020.coling-main.30}, journal={Proceedings of the 28th International Conference on Computational Linguistics}, publisher={International Committee on Computational Linguistics}, author={Dakle, Parag Pravin and Moldovan, Dan}, year={2020} }
CEREC contains 6001 email threads from the Enron Email Corpus containing 36,448 emailmessages and 38,996 entity coreference chains.
For using just the Seed corpus, follow the instructions provided here.
An email thread annotation is saved in the CoNLL format with the following naming convention:
username_directory_email_no.conll
where:
username - Name of the user directory in the Enron Email Corpus.
directory - Name of the directory for the specific user.
email_no - Filename in the specific directory.
Each annotation file is an eight column double tab separated file, and contains mention and coreference annotations. Detailed column information in the order found is as follows:
The columns contain:
Column | Type | Description |
---|---|---|
1 | Token | The actual token as found in the email thread |
2 | MI | The value of message identifier feature for the token |
3 | SI | The value of section identifier feature for the token |
4 | Speaker | The speaker of the token |
5 | Entity Type | The type of the entity this token represents. This column also contains two additional annotations - coreference chain informaion for the entity type encoded in a parenthesis structure and if the entity is the antecedent given by "". E.g. In "(P0", ( implies the token is starting a mention span, P implies the token is of PER entity type, 0 implies the token belongs to the coreference chain with id 0 for PER entity type, and * implies it is part of the antecedent of the coreference chain. |
6 | Mention | Mention information encoded in a parenthesis structure. |
7 | Coreference | Coreference chain information encoded in a parenthesis structure. |
The corpus can be found in zip file in the following directory
data/COLING/
The zip file contains the following files:
- cerec.conll - The CEREC corpus containing 6001 email threads, and their mention and coreference annotations.
- cerec.validation.XX.conll - Email threads used in the validation set for CEREC experiments. These email threads were also used as validation and test sets for feature evaluation.
- mention.corrected.XX.conll - Email threads that were manually corrected for mention annotations and then used to train a model for annotation quality evaluation.
- seed.conll - The email threads from the seed corpus containing feature annotations and separated mention annotations.
Note: Annotations for columns 2-5 are provided only for files mentioned in points 2 and 4 above. For other files, either the columns are blank or have system generated values.
The code used to generate the results can be found here. Evalution scripts for all metrics can be found here.