Improving Korean NER model accuracy by including particles in entity annotations #13705

peacefulluo · 2024-12-04T11:07:11Z

Body:

When training a Named Entity Recognition (NER) model for Korean using spaCy, I've encountered an interesting phenomenon that significantly affects the model's accuracy. I'm using the updated NER pipeline to create the configuration file.

Issue:
The model's accuracy improves substantially when I include particles (조사, josa) along with the annotated nouns in the training data.

Example:

Lower accuracy: <PERSON>김철수</PERSON>가 왔다.
Higher accuracy: <PERSON>김철수가</PERSON> 왔다.

In the second example, the particle "가" is included within the entity annotation.

Questions:

Is this a known behavior for Korean NER models in spaCy?
Are there any best practices or recommendations for handling particles in Korean NER annotations?
How might this affect the model's performance on texts where particles may vary or be omitted?
Are there any potential drawbacks to this approach that I should be aware of?

I would appreciate any insights, explanations, or suggestions on how to best approach this issue while maintaining the integrity and flexibility of the NER model for Korean language processing.

How to reproduce the behaviour

config.cfg：

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "ko"
pipeline = ["tok2vec","tagger","morphologizer","parser","lemmatizer","senter","attribute_ruler","ner"]
disabled = ["senter"]
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 256
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.attribute_ruler]
source = "ko_core_news_lg"

[components.lemmatizer]
source = "ko_core_news_lg"

[components.morphologizer]
source = "ko_core_news_lg"

[components.ner]
source = "ko_core_news_lg"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "ko_core_news_lg"

[components.senter]
source = "ko_core_news_lg"

[components.tagger]
source = "ko_core_news_lg"

[components.tok2vec]
source = "ko_core_news_lg"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 1000
frozen_components = ["tok2vec","tagger","morphologizer","parser","lemmatizer","senter","attribute_ruler"]
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@Loggers = "spacy.ConsoleLogger.v3"
console_output = true
output_file = "/media//trainer_log.jsonl"

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = 0.1
pos_acc = 0.1
morph_acc = 0.09
morph_per_feat = null
dep_uas = 0.0
dep_las = 0.29
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.04
lemma_acc = 0.1
ents_f = 0.29
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
speed = 0.0

[pretraining]

[initialize]
vocab_data = null
vectors = null
init_tok2vec = ${paths.init_tok2vec}
after_init = null
lookups = null

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "ko_core_news_lg"
vocab = "ko_core_news_lg"

[initialize.components]

[initialize.components.lemmatizer]

[initialize.components.lemmatizer.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/trainable_lemmatizer.json"
require = false

[initialize.components.morphologizer]

[initialize.components.morphologizer.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/morphologizer.json"
require = false

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json"
require = false

[initialize.components.parser]

[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/parser.json"
require = false

[initialize.components.tagger]

[initialize.components.tagger.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false

[initialize.tokenizer]

Your Environment

Operating System: ubuntu 22.04
Python Version Used: 3.10
spaCy Version Used: 3.8
Environment Information:

Info about spaCy

spaCy version: 3.8.2
Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
Python version: 3.10.15
Pipelines: ko_core_news_lg (3.8.0)

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2024-12-20T07:17:35Z

You can only train using entities that align with token boundaries. Here are more details about the tokenizer used in the provided Korean pipelines: #10624. It's not a great fit for NER for Korean, but the provided pipeline needed all components to use the same tokenization.

If you're only doing NER, you can consider using the default mecab-ko-based Korean tokenizer (or use your own custom tokenizer): https://spacy.io/usage/models#korean

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Korean NER model accuracy by including particles in entity annotations #13705

Improving Korean NER model accuracy by including particles in entity annotations #13705

peacefulluo commented Dec 4, 2024 •

edited

Loading

adrianeboyd commented Dec 20, 2024

Improving Korean NER model accuracy by including particles in entity annotations #13705

Improving Korean NER model accuracy by including particles in entity annotations #13705

Comments

peacefulluo commented Dec 4, 2024 • edited Loading

How to reproduce the behaviour

Your Environment

Info about spaCy

adrianeboyd commented Dec 20, 2024

peacefulluo commented Dec 4, 2024 •

edited

Loading