Skip to content

Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Notifications You must be signed in to change notification settings

jiesutd/SubwordEncoding-CWS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Subword encoding for Word Segmentation using Lattice LSTM.

Models and results can be found at our paper Subword Encoding in Lattice LSTM for Chinese Word Segmentation.

Requirement:

Python: 2.7   
PyTorch: 0.3.0 

Input format:

CoNLL format (prefer BMES tag scheme), with each character its label for one line. Sentences are splited with a null line.

中 B-SEG
国 E-SEG
最 B-SEG
大 E-SEG
氨 B-SEG
纶 M-SEG
丝 E-SEG
生 B-SEG
产 E-SEG
基 B-SEG
地 E-SEG
在 S-SEG
连 B-SEG
云 M-SEG
港 E-SEG
建 B-SEG
成 E-SEG

新 B-SEG
华 M-SEG
社 E-SEG
北 B-SEG
京 E-SEG
十 B-SEG
二 M-SEG
月 E-SEG
二 B-SEG
十 M-SEG
六 M-SEG
日 E-SEG
电 S-SEG

Pretrained Embeddings:

The pretrained character and word embeddings are the same with the embeddings in the baseline of RichWordSegmentor

How to run the code?

  1. Download the character embeddings, character bigram embeddings, BPE (or word) embeddings and set their directories in main.py.
  2. Modify the run_seg.py by adding your train/dev/test file directory.
  3. sh run_seg.py

Cite:

Cite our paper as:

@article{yang2019subword,  
 title={Subword Encoding in Lattice LSTM for Chinese Word Segmentation},  
 author={Jie Yang, Yue Zhang, and Shuailong Liang},  
 booktitle={NAACL},
 year={2019}  
}

About

Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published