https://arxiv.org/abs/1909.03341
We provide an implementation of byte-level byte-pair encoding (BBPE), taking IWSLT 2017 Fr-En translation as example.
Get data and generate fairseq binary dataset:
bash ./get_data.sh
Train Transformer model with Bi-GRU embedding contextualization (implemented in gru_transformer.py
):
# VOCAB=bytes
# VOCAB=chars
VOCAB=bbpe2048
# VOCAB=bpe2048
# VOCAB=bbpe4096
# VOCAB=bpe4096
# VOCAB=bpe16384
fairseq-train "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
--arch gru_transformer --encoder-layers 2 --decoder-layers 2 --dropout 0.3 --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--log-format 'simple' --log-interval 100 --save-dir "checkpoints/${VOCAB}" \
--max-sentences 100 --max-update 100000 --update-freq 2
fairseq-generate
requires bytes (BBPE) decoder to convert byte-level representation back to characters:
# BPE=--bpe bytes
# BPE=--bpe characters
BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe2048.model
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe2048.model
# BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe4096.model
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe4096.model
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe16384.model
fairseq-generate "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
--source-lang fr --gen-subset test --sacrebleu --path "checkpoints/${VOCAB}/checkpoint_last.pt" \
--tokenizer moses --moses-target-lang en ${BPE}
When using fairseq-interactive
, bytes (BBPE) encoder/decoder is required to tokenize input data and detokenize model predictions:
fairseq-interactive "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
--path "checkpoints/${VOCAB}/checkpoint_last.pt" --input data/test.fr --tokenizer moses --moses-source-lang fr \
--moses-target-lang en ${BPE} --buffer-size 1000 --max-tokens 10000
Vocabulary | Model | BLEU |
---|---|---|
Joint BPE 16k (Kudo, 2018) | 512d LSTM 2+2 | 33.81 |
Joint BPE 16k | Transformer base 2+2 (w/ GRU) | 36.64 (36.72) |
Joint BPE 4k | Transformer base 2+2 (w/ GRU) | 35.49 (36.10) |
Joint BBPE 4k | Transformer base 2+2 (w/ GRU) | 35.61 (35.82) |
Joint BPE 2k | Transformer base 2+2 (w/ GRU) | 34.87 (36.13) |
Joint BBPE 2k | Transformer base 2+2 (w/ GRU) | 34.98 (35.43) |
Characters | Transformer base 2+2 (w/ GRU) | 31.78 (33.30) |
Bytes | Transformer base 2+2 (w/ GRU) | 31.57 (33.62) |
@misc{wang2019neural,
title={Neural Machine Translation with Byte-Level Subwords},
author={Changhan Wang and Kyunghyun Cho and Jiatao Gu},
year={2019},
eprint={1909.03341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Changhan Wang ([email protected]), Kyunghyun Cho ([email protected]), Jiatao Gu ([email protected])