taiwan-mandarin-corpus-annotation

An annotation for the NCCU Taiwanese Mandarin Corpus, using Rezonator and based on discourse-functional principles. This is a collaborative project involving students at the University of California, Santa Barbara.

Workflow

Tokenisation

Step 1: Use 1_auto_split.R to automatically split the text into intonation units; result should be stored in 1_auto_split
Step 2: Use 2_auto_tokenise.R to automatically tokenise the text; result should be stored in 2_auto_tokenized
Step 3: Copy the file to 3_manual_tokenised and add your name to the end (e.g. NCCU-TM009-CN-FFF_Ryan.csv). Use a program like Pulsar to open the CSV, then listen to the recording and correct the tokenisation. At the same time that you tokenise, listen for errors in the transcription and correct them. This should be done separately by two people. Instructions can be found in the 'Tokenisation ...' ppt.
Step 4: Check with your partner and agree on a final tokenisation + transcription. It would be useful to use a diffchecker https://www.diffchecker.com/) to make sure you catch all the differences. Put the final tokenisation in 4_final_tokenised.

CSV > Rez conversion

Step 5: Use 5_dft_convert.R to automatically convert the FINAL tokenisation to DFT format. Put the result in the 5_dft_converted folder
Step 6: Use the file 6_to_rez.R to convert the DFT-formatted file in 5_dft_converted into owpl (one-word-per-line) .csv format in the 6_to_rez folder
Step 7: Import the owpl .csv in the 6_to_rez folder into Rezonator using the owpl_mandarin.json schema file and save the .rez file as e.g. NCCU-TM009-CN-FFF.rez:

IU segmentation

Step 8: Copy the .rez file to 8_manual_split and add your name, e.g. NCCU-TM009-CN-FFF-Ryan.rez. Then listen to the folder, correct the correct IU splits and add endnotes ('punctuation'). Instructions can be found in the 'IU splitting ...' ppt.
Step 9: A third person should take the .rez files done by the two people in Step 8, give tie-breaking votes on the differences, and then save the file in the 9_final_split folder.

Postprocessing

Step 10: Use the 10_final_csv.R file to convert the .rez file into two new formats: a .txt file in 10_final_csv_unit + a .csv file in 10_final_csv_owpl folders.
Step 11: Import the .csv file into Rezonator and save it in the 11_final_rez folder using the import schema 11_owpl_end.json.

Name		Name	Last commit message	Last commit date
Latest commit History 362 Commits
0_raw		0_raw
10_final_csv		10_final_csv
10_final_csv_owpl		10_final_csv_owpl
10_final_csv_unit		10_final_csv_unit
11_final_rez		11_final_rez
12_textgrid		12_textgrid
13_pred_rez		13_pred_rez
1_auto_split		1_auto_split
2_auto_tokenised		2_auto_tokenised
3_manual_tokenised		3_manual_tokenised
4_final_tokenised		4_final_tokenised
5_dft_converted		5_dft_converted
6_rez_input		6_rez_input
7_rez_file		7_rez_file
8_manual_split		8_manual_split
9_final_split		9_final_split
old_5_manual_split		old_5_manual_split
old_6_final_split		old_6_final_split
old_7_dft_converted		old_7_dft_converted
old_8_rez_input		old_8_rez_input
old_9_rez_file		old_9_rez_file
old_workflow_1		old_workflow_1
temp		temp
.DS_Store		.DS_Store
.gitignore		.gitignore
0_installations.R		0_installations.R
10_final_csv.R		10_final_csv.R
10_rez.R		10_rez.R
11_owpl_end.json		11_owpl_end.json
1_auto_split.R		1_auto_split.R
20220129_iaa.RData		20220129_iaa.RData
2_auto_tokenise.R		2_auto_tokenise.R
5_dft_convert.R		5_dft_convert.R
6_sim_score.R		6_sim_score.R
6_to_rez.R		6_to_rez.R
Adding dependency trees to the Taiwan Mandarin corpus.pptx		Adding dependency trees to the Taiwan Mandarin corpus.pptx
Annotating entity types in the Taiwan Mandarin corpus.pptx		Annotating entity types in the Taiwan Mandarin corpus.pptx
Annotating predicates for the Mandarin conversation project.pptx		Annotating predicates for the Mandarin conversation project.pptx
IU splitting for the Taiwan Mandarin Corpus.pptx		IU splitting for the Taiwan Mandarin Corpus.pptx
Predicting the choice of referential expressions in mandarin.pptx		Predicting the choice of referential expressions in mandarin.pptx
README.md		README.md
Some random file.txt.txt		Some random file.txt.txt
Taiwan Mandarin accessibility project – final step.pptx		Taiwan Mandarin accessibility project – final step.pptx
Tokenisation for the Taiwan Mandarin Corpus.pptx		Tokenisation for the Taiwan Mandarin Corpus.pptx
Working with the Taiwan Mandarin Corpus GitHub repository.pptx		Working with the Taiwan Mandarin Corpus GitHub repository.pptx
accessibility_coding.csv		accessibility_coding.csv
check_text.R		check_text.R
instructions_for_accessibility_anno.docx		instructions_for_accessibility_anno.docx
inter-annotated.R		inter-annotated.R
owpl_mandarin.json		owpl_mandarin.json
responsibilities.csv		responsibilities.csv
rezR_import.R		rezR_import.R
tag_step1.json		tag_step1.json
tag_step2.json		tag_step2.json
~$Predicting the choice of referential expressions in mandarin.pptx		~$Predicting the choice of referential expressions in mandarin.pptx
~$structions_for_accessibility_anno.docx		~$structions_for_accessibility_anno.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

taiwan-mandarin-corpus-annotation

Workflow

Tokenisation

CSV > Rez conversion

IU segmentation

Postprocessing

About

Releases

Packages

Contributors 15

Languages

kayaulai/taiwan-mandarin-corpus-annotation

Folders and files

Latest commit

History

Repository files navigation

taiwan-mandarin-corpus-annotation

Workflow

Tokenisation

CSV > Rez conversion

IU segmentation

Postprocessing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 15

Languages

Packages