Skip to content

Two sequence tagging systems (one rule-based and one neural) specifically made for tagging terms in the swimming domain, but which can be easily adapted for other purposes.

Notifications You must be signed in to change notification settings

Dodo-s95/the-term-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Term Crawler

This repository provides the code for two sequence tagging systems, created with the final aim of tagging terms present in a textual corpus using the BIO tag set. It can be easly adapted to tag using another notation, or to tag other entities.

This code is specifically made to retrieve and tag a corpus and train and test models with the BIO notation on the specific field of swimming.

Installation

After cloning this repository, you should install the required packages:

pip install -r requirements.txt

The project was tested with Python 3.8.0.

Running the code

Dataset Extraction

The dataset can be extracted by running scraper.py, and converted into usable txt files by running converter.py.

Terms Extraction

To extract the terms from the corpus we used TermSuite. We cleaned the first extraction using term_cleaner.ipynb, and manually validated the outcome, also adding and removing some entries. This is the final list of terms.

Rule-Based Tagging

The rule-based tagger will tag the corpus you extracted previously using the BIO notation. It will also split the dataset into train, validation and test sets. These silver tagged data will be used to train the neural sequence tagger.

Neural Tagging

The neural term-tagger is based on flair. To train the model you will just need a BIO tagged corpus and run the code as it is. The best model will be automatically saved in a folder of your choosing.

Inspection of Results

To inspect the results and compare the outputs of the two taggers, we wrapped the rule-based tagger and created testing.ipynb. In this latter file, it is enough to input a new text, and you will have the output of the two taggers, ready to be compared.

About

Two sequence tagging systems (one rule-based and one neural) specifically made for tagging terms in the swimming domain, but which can be easily adapted for other purposes.

Topics

Resources

Stars

Watchers

Forks