Faroese corpus taken from Wikipedia dumps.
This repository will contain corpus of Faroese language taken from the content dump of Faroese Wikipedia.
This project uses pipenv
. How to install pipenv
.
In order to read 7zip archives (used by Wikia's XML dumps) you need to install libarchive
:
pipenv install
sudo apt install libarchive-dev
Run pipenv shell
before running them.
Shows the longest words taken from the dump:
1 llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch - 58
2 samvinnufelagiðsamvinnufelagnum - 31
3 krabbameinsgranskingarstovnurin - 31
4 southernplayalisticadillacmuzik - 31
5 barnabókavirðislønavinnararnar - 30
6 norðurlandameistarakappingini - 29
7 sjónvarpsundirhaldssendingini - 29
8 bókmentakritikaraheiðurslønir - 29
9 einstaklingaítróttargreinunum - 29
10 vegsúkklukappingarmeistaranum - 29