Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Korean NER model accuracy by including particles in entity annotations #13705

Open
peacefulluo opened this issue Dec 4, 2024 · 1 comment

Comments

@peacefulluo
Copy link

peacefulluo commented Dec 4, 2024

Body:

When training a Named Entity Recognition (NER) model for Korean using spaCy, I've encountered an interesting phenomenon that significantly affects the model's accuracy. I'm using the updated NER pipeline to create the configuration file.

Issue:
The model's accuracy improves substantially when I include particles (조사, josa) along with the annotated nouns in the training data.

Example:

Lower accuracy: <PERSON>김철수</PERSON>가 왔다.
Higher accuracy: <PERSON>김철수가</PERSON> 왔다.

In the second example, the particle "가" is included within the entity annotation.

Questions:

Is this a known behavior for Korean NER models in spaCy?
Are there any best practices or recommendations for handling particles in Korean NER annotations?
How might this affect the model's performance on texts where particles may vary or be omitted?
Are there any potential drawbacks to this approach that I should be aware of?

I would appreciate any insights, explanations, or suggestions on how to best approach this issue while maintaining the integrity and flexibility of the NER model for Korean language processing.

How to reproduce the behaviour

config.cfg:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "ko"
pipeline = ["tok2vec","tagger","morphologizer","parser","lemmatizer","senter","attribute_ruler","ner"]
disabled = ["senter"]
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 256
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.attribute_ruler]
source = "ko_core_news_lg"

[components.lemmatizer]
source = "ko_core_news_lg"

[components.morphologizer]
source = "ko_core_news_lg"

[components.ner]
source = "ko_core_news_lg"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "ko_core_news_lg"

[components.senter]
source = "ko_core_news_lg"

[components.tagger]
source = "ko_core_news_lg"

[components.tok2vec]
source = "ko_core_news_lg"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 1000
frozen_components = ["tok2vec","tagger","morphologizer","parser","lemmatizer","senter","attribute_ruler"]
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@Loggers = "spacy.ConsoleLogger.v3"
console_output = true
output_file = "/media//trainer_log.jsonl"

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = 0.1
pos_acc = 0.1
morph_acc = 0.09
morph_per_feat = null
dep_uas = 0.0
dep_las = 0.29
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.04
lemma_acc = 0.1
ents_f = 0.29
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
speed = 0.0

[pretraining]

[initialize]
vocab_data = null
vectors = null
init_tok2vec = ${paths.init_tok2vec}
after_init = null
lookups = null

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "ko_core_news_lg"
vocab = "ko_core_news_lg"

[initialize.components]

[initialize.components.lemmatizer]

[initialize.components.lemmatizer.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/trainable_lemmatizer.json"
require = false

[initialize.components.morphologizer]

[initialize.components.morphologizer.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/morphologizer.json"
require = false

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json"
require = false

[initialize.components.parser]

[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/parser.json"
require = false

[initialize.components.tagger]

[initialize.components.tagger.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false

[initialize.tokenizer]

Your Environment

  • Operating System: ubuntu 22.04
  • Python Version Used: 3.10
  • spaCy Version Used: 3.8
  • Environment Information:

Info about spaCy

  • spaCy version: 3.8.2
  • Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
  • Python version: 3.10.15
  • Pipelines: ko_core_news_lg (3.8.0)
@adrianeboyd
Copy link
Contributor

You can only train using entities that align with token boundaries. Here are more details about the tokenizer used in the provided Korean pipelines: #10624. It's not a great fit for NER for Korean, but the provided pipeline needed all components to use the same tokenization.

If you're only doing NER, you can consider using the default mecab-ko-based Korean tokenizer (or use your own custom tokenizer): https://spacy.io/usage/models#korean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants