This is the code to reproduce results in our EMNLP2021 paper Levenshtein Training for Word-level Quality Estimation.
We did all of our experiments with a huge ducttape workflow, which also includes a lot of system-specific setup for our environment. As a result, we provide a concise bash script to help reproduce the best result from our LevT checkpoints, while also providing the raw workflow for you to adapt to your own environment, or deep-dive into some aspects that were not covered in the bash script. If you have any questions about the workflow, feel free to post an issue and I'll try my best to answer.
We are going to assume that you have the following binaries reachable from $PATH
:
spm_encode
andspm_decode
, installed fromhttps://github.com/google/sentencepiece
(we used v0.1.5 but any version should work)teralign
, installed fromhttps://github.com/marian-nmt/moses-scorers
(compile the binary following theREADME.md
of that repo)
Note that instead of using the more popular tercom, we use our own implementation teralign to do the TER computation (which, in our opinion, is easier to use). But we do see some slight mismatch between the edit tags generated by tercom vs. teralign because of the ambiguities in beam search. Hence, to produce comparable results with the previous WMT submissions, please do not generate your own edit tags on the test set with teralign and evaluate against them.
We assume that $BASE
is the path of this repository in your system.
BASE=/path/to/repo
# untar some data
cd $BASE/data/data/post-editing/train
tar -zxvf en-de-train.tar.gz
tar -zxvf en-zh-train.tar.gz
cd $BASE/data/data/post-editing/dev
tar -zxvf en-de-dev.tar.gz
tar -zxvf en-zh-dev.tar.gz
cd $BASE/data/data/post-editing/test
tar -zxvf en-de-test.tar.gz
tar -zxvf en-zh-test.tar.gz
# download BPE model
mkdir -p $BASE/models
cd $BASE/models
wget https://dl.fbaipublicfiles.com/m2m_100/spm.128k.model
wget https://dl.fbaipublicfiles.com/m2m_100/model_dict.128k.txt
You'll also need to download our intermediate checkpoints:
cd $BASE/models
# the run.sh script is set to reproduce the best en-de results (MCC=0.589)
# we have a few other checkpoints for download:
# en-de M2M w/o synthetic pre-training: https://www.cs.jhu.edu/~sding/downloads/emnlp2021/emnlp2021-en-de-nat.pt (MCC=0.583)
# en-zh M2M w/o synthetic pre-training: https://www.cs.jhu.edu/~sding/downloads/emnlp2021/emnlp2021-en-zh-nat.pt (MCC=0.633)
# en-zh M2M w synthetic pre-training: https://www.cs.jhu.edu/~sding/downloads/emnlp2021/emnlp2021-en-zh-best.pt (MCC=0.646)
wget https://www.cs.jhu.edu/~sding/downloads/emnlp2021/emnlp2021-en-de-best.pt
Open run.sh
, update the value of BASE
to the path where you stored your repo.
Then, simply running bash run.sh
should reproduce the best en-de result for you.
You can configure checkpoint
, src
and tgt
to reproduce the other results built from the M2M model.
If you use this codebase, or the teralign
binary from the moses-scorers
repo, please cite the following paper:
@inproceedings{ding-etal-2021-levenshtein,
title = "{L}evenshtein Training for Word-level Quality Estimation",
author = "Ding, Shuoyang and
Junczys-Dowmunt, Marcin and
Post, Matt and
Koehn, Philipp",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.539",
pages = "6724--6733",
abstract = "We propose a novel scheme to use the Levenshtein Transformer to perform the task of word-level quality estimation. A Levenshtein Transformer is a natural fit for this task: trained to perform decoding in an iterative manner, a Levenshtein Transformer can learn to post-edit without explicit supervision. To further minimize the mismatch between the translation task and the word-level QE task, we propose a two-stage transfer learning procedure on both augmented data and human post-editing data. We also propose heuristics to construct reference labels that are compatible with subword-level finetuning and inference. Results on WMT 2020 QE shared task dataset show that our proposed method has superior data efficiency under the data-constrained setting and competitive performance under the unconstrained setting.",
}