The Vuk'uzenzele South African Multilingual Corpus

Github: https://github.com/dsfsi/vukuzenzele-nlp/

Zenodo:

Arxiv Preprint:

Give Feedback 📑: DSFSI Resource Feedback Form{:target="_blank"}

About dataset

The dataset contains editions from the South African government magazine Vuk'uzenzele, created by the Government Communication and Information System (GCIS). Data was scraped from PDFs that have been placed in the data/raw folder. The PDFS were obtatined from the Vuk'uzenzele website.

The datasets contain government magazine editions in 11 languages, namely:

Language	Code	Language	Code
English	(eng)	Sepedi	(sep)
Afrikaans	(afr)	Setswana	(tsn)
isiNdebele	(nbl)	Siswati	(ssw)
isiXhosa	(xho)	Tshivenda	(ven)
isiZulu	(zul)	Xitstonga	(tso)
Sesotho	(nso)

Number of Aligned Pairs with Cosine Similarity Score >= 0.65

src_lang	trg_lang	num_aligned_pairs
ssw	xho	2202
ssw	zul	2183
xho	zul	2102
nso	xho	2081
nso	tso	2071
ssw	tso	2034
nso	ssw	2021
tsn	tso	2020
tsn	xho	2009
tso	xho	2009
nso	tsn	2002
ssw	tsn	1987
tso	zul	1957
nso	zul	1953
tsn	zul	1933
eng	zul	1923
eng	tso	1923
eng	nso	1867
eng	ssw	1821
afr	xho	1816
eng	xho	1801
nbl	sep	1795
sep	ven	1794
afr	ssw	1783
eng	tsn	1772
afr	zul	1769
afr	nso	1746
nbl	ven	1699
afr	eng	1661
afr	tsn	1631
afr	tso	1617
afr	sep	551
afr	ven	498
afr	nbl	491
nso	sep	410
nso	ven	352
sep	tso	326
sep	tsn	319
tso	ven	307
sep	ssw	305
sep	xho	300
ssw	ven	290
tsn	ven	285
nbl	ssw	282
nbl	nso	266
ven	xho	260
eng	sep	258
nbl	xho	250
sep	zul	249
nbl	tso	238
eng	ven	234
nbl	tsn	230
nbl	zul	226
ven	zul	225
eng	nbl	184

The dataset is present in several forms on the repo. Generally the dataset is split by edition, eg. 2020-01-ed1
The data directory is broken down as follows

./data
├── external                # Data external to this repo
├── interim                 # I am not really sure - looks like interim in regards to processed.
├── processed               # The data from scraping the raw pdfs
├── raw                     # The raw pdfs of the Vuk'uzenzele magazine
├── sentence_align_output   # The output (csv) of the sentence alignment with LASER language encoders
└── simple_align_output     # The output (csv) of a simple one to one sentence alignment

The dataset is split by edition in the data/processed folder.

Disclaimer

This dataset contains machine-readable data extracted from PDF documents, from https://www.vukuzenzele.gov.za/, provided by the Government Communication Information System (GCIS). While efforts were made to ensure the accuracy and completeness of this data, there may be errors or discrepancies between the original publications and this dataset. No warranties, guarantees or representations are given in relation to the information contained in the dataset. The members of the Data Science for Societal Impact Research Group bear no responsibility and/or liability for any such errors or discrepancies in this dataset. The Government Communication Information System (GCIS) bears no responsibility and/or liability for any such errors or discrepancies in this dataset. It is recommended that users verify all information contained herein before making decisions based upon this information.

Authors

Vukosi Marivate - @vukosi
Andani Madodonga
Daniel Njini
Richard Lastrucci
Isheanesu Dzingirai
Jenalea Rajab

Citation

Paper

Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

@inproceedings{lastrucci-etal-2023-preparing, title = "Preparing the Vuk{'}uzenzele and {ZA}-gov-multilingual {S}outh {A}frican multilingual corpora", author = "Richard Lastrucci and Isheanesu Dzingirai and Jenalea Rajab and Andani Madodonga and Matimba Shingange and Daniel Njini and Vukosi Marivate", booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.rail-1.3", pages = "18--25" }

Dataset

Vukosi Marivate, Andani Madodonga, Daniel Njini, Richard Lastrucci, Isheanesu Dzingirai, Jenalea Rajab. The Vuk'uzenzele South African Multilingual Corpus, 2023

@dataset{marivate_vukosi_2023_7598540, author = {Marivate, Vukosi and Njini, Daniel and Madodonga, Andani and Lastrucci, Richard and Dzingirai, Isheanesu Rajab, Jenalea}, title = {The Vuk'uzenzele South African Multilingual Corpus}, month = feb, year = 2023, publisher = {Zenodo}, doi = {10.5281/zenodo.7598539}, url = {https://doi.org/10.5281/zenodo.7598539} }

Licences

License for Data - CC 4.0 BY
Licence for Code - MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
.github		.github
data		data
docs		docs
models		models
notebooks		notebooks
references		references
reports		reports
src		src
.gitignore		.gitignore
DATASHEET.md		DATASHEET.md
LICENSE.data.md		LICENSE.data.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
filtered_data.txt		filtered_data.txt
out.txt		out.txt
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

License

dsfsi/vukuzenzele-nlp

Folders and files

Latest commit

History

Repository files navigation

The Vuk'uzenzele South African Multilingual Corpus

About dataset

Number of Aligned Pairs with Cosine Similarity Score >= 0.65

Disclaimer

Authors

Citation

Licences

About

Topics

Resources

License

Stars

Watchers

Forks

Languages