Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

./sejong/c2d.sh error #36

Open
YeopIn opened this issue Jul 10, 2018 · 5 comments
Open

./sejong/c2d.sh error #36

YeopIn opened this issue Jul 10, 2018 · 5 comments

Comments

@YeopIn
Copy link

YeopIn commented Jul 10, 2018

I had a problem with [training parser from Sejong treebank corpus]

  1. ,/sejong/split.sh -v -v is ok
    1

but, ./sejong/c2d.sh -v -v had error
2

what should i do??

@dsindex
Copy link
Owner

dsindex commented Jul 10, 2018

@YeopIn

  1. you need to place a constituent parse tree corpus(sejong_treebank.txt.v1) to sejong directory.
$ ls
align.py  align_r.py  c2d.py  c2d.sh  context.pbtxt_p  env.sh  eval.py  log  sejong_treebank.sample  sejong_treebank.txt.v1  split.py  split.sh  tagged_input.sample  tagger.py  wdir
$ more sejong_treebank.txt.v1
; 1993/06/08 19
(NP	(NP 1993/SN + //SP + 06/SN + //SP + 08/SN)
	(NP 19/SN))

; 엠마누엘 웅가로 /
(NP	(NP	(NP 엠마누엘/NNP)
		(NP 웅가로/NNP))
	(X //SP))

; 의상서 실내 장식품으로…
(NP_AJT	(NP_AJT 의상/NNG + 서/JKB)
	(NP_AJT	(NP 실내/NNG)
		(NP_AJT 장식품/NNG + 으로/JKB + …/SE)))

; 디자인 세계 넓혀
(VP	(NP_OBJ	(NP 디자인/NNG)
		(NP_OBJ 세계/NNG))
	(VP 넓히/VV + 어/EC))
...
  1. run split.sh, you will have
$ ls wdir
sejong_treebank.txt.v1.test
sejong_treebank.txt.v1.training
sejong_treebank.txt.v1.tuning
  1. run 'c2d.sh`
  • as you see, this script generates .v2, .v3 files
for SET in training tuning test; do
    ${python} ${CDIR}/c2d.py --mode=0 < ${WDIR}/sejong_treebank.txt.v1.${SET} > ${WDIR}/sejong_treebank.txt.v2.${SET} 2> ${WDIR}/sejong_treebank.txt.v2.${SET}.err
    ${python} ${CDIR}/c2d.py --mode=1 < ${WDIR}/sejong_treebank.txt.v2.${SET} > ${WDIR}/deptree.txt.v2.${SET}         2> ${WDIR}/deptree.txt.v2.${SET}.err
    [ "${SET}" == "training" ] && extend=1 || extend=0
    ${python} ${CDIR}/align.py --extend=${extend} < ${WDIR}/deptree.txt.v2.${SET} > ${WDIR}/deptree.txt.v3.${SET}
done
  • if you have some troubles, then test like this
$ python c2d.py --mode=0 < wdir/sejong_treebank.txt.v1.training > wdir/sejong_treebank.txt.v2.training
  • you may notice which points were problem.

@YeopIn
Copy link
Author

YeopIn commented Jul 13, 2018

I solved this problem, Thank you.

How to training Korean pos tagging?
Is that true for Korean pos tagging using train_dragnn.sh? and data using UD_Korean(universal_dependencies-2.0-ud_treebans-v2.0tgz)?
Is it need sejong_treebank.v1? I knew sejong_treebank.v1 is for Korean parser

I downloaded UD_Korean version of 2.0,
screenshot from 2018-07-13 14-26-36

I changed SRC_CORPUS_DIR = UD_Korean and TRAIN_FILE = kr-ud-train.conllu and DEV_FILE = kr-ud-dev.conllu in train_dragnn.sh
screenshot from 2018-07-13 14-32-47

but, There is out of range Error? What should I do?
screenshot from 2018-07-13 14-37-06

@dsindex
Copy link
Owner

dsindex commented Jul 13, 2018

@YeopIn

  1. Is that true for Korean pos tagging using train_dragnn.sh?

-> No, train_dragnn.sh stands for training dependency parser only. it is basically same as train_dragnn_sejong.sh.

  1. data using UD_Korean(universal_dependencies-2.0-ud_treebans-v2.0tgz)?
    Is it need sejong_treebank.v1? I knew sejong_treebank.v1 is for Korean parser ...

-> i think you need to check *.conllu.conv. 'convert.py' generates '.conv' files and those files are used as training/tune corpus

TRAIN_FILE=${DATA_DIR}/en-ud-train.conllu.conv
DEV_FILE=${DATA_DIR}/en-ud-dev.conllu.conv
CHECKPOINT_FILE=${DATA_DIR}/checkpoint.model

function convert_corpus {
    local _corpus_dir=$1
    for corpus in $(ls ${_corpus_dir}/*.conllu); do
        ${python} ${CDIR}/convert.py < ${corpus} > ${corpus}.conv
    done
}

...
--training_corpus_path=${TRAIN_FILE} 
--tune_corpus_path=${DEV_FILE}

@YeopIn
Copy link
Author

YeopIn commented Jul 16, 2018

Thank you so much..
My final goal is training both Korean Tag and Parser with Sejong Corpus data. Is there a way to solution?

@dsindex
Copy link
Owner

dsindex commented Jul 16, 2018

there was a similar discussion before
#4 (comment)

but, i couldn't find proper way to train Korean POS tagger.
i thought ... it is worth that i use other Korean POS tagger(Konlpy) or implement character-based POS tagger for Korean and reconstruct morphs from inflectional forms.
for example,

tagging : '하늘을 나는 새를 본다' -> '하/b-ncn 늘/i-ncn 을/b-jks 나/b-vv 는/b-etm 새/b-ncn 를/b-jko 본/b-vv 다/b-ec'
reconstruct : '하늘/ncn 을/jks 날/vv 는/etm 새/ncn 를/jko 보/vv ㄴ다/ec'

of course, you need some extra resources for converting '본/b-vv 다/b-ec' -> '보/vv ㄴ다/ec'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants