Skip to content

Latest commit

 

History

History

ner_spancat_compare

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

🪐 Weasel Project: Comparing SpanCat and NER using a corpus of biomedical literature (GENIA)

This project demonstrates how spaCy's Span Categorization (SpanCat) and Named-Entity Recognition (NER) perform on different types of entities. Here, we used a dataset of biomedical literature containing both overlapping and non-overlapping spans.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command Description
download Download model-related assets
convert Convert IOB file into the spaCy format
create-ner Split corpus into separate NER datasets for each GENIA label
train-ner Train an NER model for each label
train-spancat Train a SpanCat model
evaluate-ner Evaluate all NER models
assemble-ner Assemble all NER models into a single pipeline
evaluate-spancat Evaluate SpanCat model

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all downloadconvertcreate-nertrain-nerassemble-nertrain-spancatevaluate-nerevaluate-spancat
spancat downloadconverttrain-spancatevaluate-spancat
ner downloadconvertcreate-nertrain-nerevaluate-nerassemble-ner

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File Source Description
assets/train.iob2 URL The training dataset for GENIA in IOB format.
assets/dev.iob2 URL The evaluation dataset for GENIA in IOB format.
assets/test.iob2 URL The test dataset for GENIA in IOB format.

About the dataset

GENIA is a dataset containing biomedical literature from 1,999 Medline abstracts. It contains a collection of overlapping and hierarchical spans. To make parsing easier, we will be using the pre-constructed IOB tags from the Boundary Aware Nested NER paper repository. Running debug data gives us the following span characteristics (SD = Span Distinctiveness, BD = Boundary Distinctiveness):

Span Type Span Length SD BD
DNA 2.81 1.45 0.80
protein 2.19 1.19 0.57
cell_type 2.09 2.35 1.05
cell_line 3.29 1.91 1.04
RNA 2.73 2.68 1.28

The table above shows the average span length for each span type, and their corresponding distinctiveness characteristics. The latter is computed using the KL-divergence of the span's token distribution with respect to the overall corpus's. The higher the number is, the more distinct the tokens are compared to the rest of the corpus.

These characteristics can give us a good intuition as to how well the SpanCat model identify the correct spans. In the case of GENIA, the entities themselves tend to be technical terms, which makes it more distinct and easier to classify. Again, we measure distinctiveness not only within the entities themselves (SD), but also in its boundaries (BD).

Here's some example data:

Experiments

Given what we know from the dataset, we will create the following pipelines:

Pipeline Description Workflow Name
SpanCat Pure Span Categorization for all types of entities. Serves as illustration to demonstrate suggester functions and as comparison to NER. spancat
NER Named-Entity Recognition for all types of entities. Serves as illustration to compare with the pure SpanCat implementation ner

SpanCat Results

Below are the results for SpanCat. It seems that overall, span categorization is biased towards precision. This means that a large number of the suggested spans belong to the correct class. We can always tune how precise we want it to be: make the suggester lenient and we might get a lot of irrelevant hits, make it strict and we miss might out on some true positives.

Precision Recall F-score
DNA 0.70 0.36 0.47
protein 0.77 0.52 0.62
cell_line 0.77 0.30 0.44
cell_type 0.76 0.62 0.68
RNA 0.77 0.25 0.38
Overall 0.76 0.47 0.58

NER Results

NER performs well against SpanCat for all entity types, but note that this process entails training five (5) models per entity type. This might work if you have a small number of entities, but can be computationally heavy if you have a lot.

Precision Recall F-score
DNA 0.74 0.63 0.68
protein 0.76 0.72 0.74
cell_line 0.74 0.57 0.64
cell_type 0.78 0.72 0.75
RNA 0.86 0.65 0.74

Since we have five (5) separate models in NER, what we can do afterwards is combine them into a single Doc that transfers doc.ents to doc.spans. Since the tokens are the same, we don't need to worry about misalignments and the like.