Name		Name	Last commit message	Last commit date
parent directory ..
configs		configs
images		images
scripts		scripts
.gitignore		.gitignore
README.md		README.md
project.yml		project.yml
requirements.txt		requirements.txt

README.md

🪐 Weasel Project: Comparing SpanCat and NER using a corpus of biomedical literature (GENIA)

This project demonstrates how spaCy's Span Categorization (SpanCat) and Named-Entity Recognition (NER) perform on different types of entities. Here, we used a dataset of biomedical literature containing both overlapping and non-overlapping spans.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`download`	Download model-related assets
`convert`	Convert IOB file into the spaCy format
`create-ner`	Split corpus into separate NER datasets for each GENIA label
`train-ner`	Train an NER model for each label
`train-spancat`	Train a SpanCat model
`evaluate-ner`	Evaluate all NER models
`assemble-ner`	Assemble all NER models into a single pipeline
`evaluate-spancat`	Evaluate SpanCat model

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`all`	`download` → `convert` → `create-ner` → `train-ner` → `assemble-ner` → `train-spancat` → `evaluate-ner` → `evaluate-spancat`
`spancat`	`download` → `convert` → `train-spancat` → `evaluate-spancat`
`ner`	`download` → `convert` → `create-ner` → `train-ner` → `evaluate-ner` → `assemble-ner`

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File	Source	Description
`assets/train.iob2`	URL	The training dataset for GENIA in IOB format.
`assets/dev.iob2`	URL	The evaluation dataset for GENIA in IOB format.
`assets/test.iob2`	URL	The test dataset for GENIA in IOB format.

About the dataset

GENIA is a dataset containing biomedical literature from 1,999 Medline abstracts. It contains a collection of overlapping and hierarchical spans. To make parsing easier, we will be using the pre-constructed IOB tags from the Boundary Aware Nested NER paper repository. Running debug data gives us the following span characteristics (SD = Span Distinctiveness, BD = Boundary Distinctiveness):

Span Type	Span Length	SD	BD
DNA	2.81	1.45	0.80
protein	2.19	1.19	0.57
cell_type	2.09	2.35	1.05
cell_line	3.29	1.91	1.04
RNA	2.73	2.68	1.28

The table above shows the average span length for each span type, and their corresponding distinctiveness characteristics. The latter is computed using the KL-divergence of the span's token distribution with respect to the overall corpus's. The higher the number is, the more distinct the tokens are compared to the rest of the corpus.

These characteristics can give us a good intuition as to how well the SpanCat model identify the correct spans. In the case of GENIA, the entities themselves tend to be technical terms, which makes it more distinct and easier to classify. Again, we measure distinctiveness not only within the entities themselves (SD), but also in its boundaries (BD).

Here's some example data:

Experiments

Given what we know from the dataset, we will create the following pipelines:

Pipeline	Description	Workflow Name
SpanCat	Pure Span Categorization for all types of entities. Serves as illustration to demonstrate suggester functions and as comparison to NER.	`spancat`
NER	Named-Entity Recognition for all types of entities. Serves as illustration to compare with the pure SpanCat implementation	`ner`

SpanCat Results

Below are the results for SpanCat. It seems that overall, span categorization is biased towards precision. This means that a large number of the suggested spans belong to the correct class. We can always tune how precise we want it to be: make the suggester lenient and we might get a lot of irrelevant hits, make it strict and we miss might out on some true positives.

	Precision	Recall	F-score
DNA	0.70	0.36	0.47
protein	0.77	0.52	0.62
cell_line	0.77	0.30	0.44
cell_type	0.76	0.62	0.68
RNA	0.77	0.25	0.38
Overall	0.76	0.47	0.58

NER Results

NER performs well against SpanCat for all entity types, but note that this process entails training five (5) models per entity type. This might work if you have a small number of entities, but can be computationally heavy if you have a lot.

	Precision	Recall	F-score
DNA	0.74	0.63	0.68
protein	0.76	0.72	0.74
cell_line	0.74	0.57	0.64
cell_type	0.78	0.72	0.75
RNA	0.86	0.65	0.74

Since we have five (5) separate models in NER, what we can do afterwards is combine them into a single Doc that transfers doc.ents to doc.spans. Since the tokens are the same, we don't need to worry about misalignments and the like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ner_spancat_compare

ner_spancat_compare

README.md

🪐 Weasel Project: Comparing SpanCat and NER using a corpus of biomedical literature (GENIA)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

About the dataset

Experiments

SpanCat Results

NER Results

Files

ner_spancat_compare

Directory actions

More options

Directory actions

More options

Latest commit

History

ner_spancat_compare

Folders and files

parent directory

README.md

🪐 Weasel Project: Comparing SpanCat and NER using a corpus of biomedical literature (GENIA)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

About the dataset

Experiments

SpanCat Results

NER Results