The Term Crawler

This repository provides the code for two sequence tagging systems, created with the final aim of tagging terms present in a textual corpus using the BIO tag set. It can be easly adapted to tag using another notation, or to tag other entities.

This code is specifically made to retrieve and tag a corpus and train and test models with the BIO notation on the specific field of swimming.

Installation

After cloning this repository, you should install the required packages:

pip install -r requirements.txt

The project was tested with Python 3.8.0.

Running the code

Dataset Extraction

The dataset can be extracted by running scraper.py, and converted into usable txt files by running converter.py.

Terms Extraction

To extract the terms from the corpus we used TermSuite. We cleaned the first extraction using term_cleaner.ipynb, and manually validated the outcome, also adding and removing some entries. This is the final list of terms.

Rule-Based Tagging

The rule-based tagger will tag the corpus you extracted previously using the BIO notation. It will also split the dataset into train, validation and test sets. These silver tagged data will be used to train the neural sequence tagger.

Neural Tagging

The neural term-tagger is based on flair. To train the model you will just need a BIO tagged corpus and run the code as it is. The best model will be automatically saved in a folder of your choosing.

Inspection of Results

To inspect the results and compare the outputs of the two taggers, we wrapped the rule-based tagger and created testing.ipynb. In this latter file, it is enough to input a new text, and you will have the output of the two taggers, ready to be compared.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

The Term Crawler

Installation

Running the code

Dataset Extraction

Terms Extraction

Rule-Based Tagging

Neural Tagging

Inspection of Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

The Term Crawler

Installation

Running the code

Dataset Extraction

Terms Extraction

Rule-Based Tagging

Neural Tagging

Inspection of Results