You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training a Named Entity Recognition (NER) model for Korean using spaCy, I've encountered an interesting phenomenon that significantly affects the model's accuracy. I'm using the updated NER pipeline to create the configuration file.
Issue:
The model's accuracy improves substantially when I include particles (조사, josa) along with the annotated nouns in the training data.
Example:
Lower accuracy: <PERSON>김철수</PERSON>가 왔다.
Higher accuracy: <PERSON>김철수가</PERSON> 왔다.
In the second example, the particle "가" is included within the entity annotation.
Questions:
Is this a known behavior for Korean NER models in spaCy?
Are there any best practices or recommendations for handling particles in Korean NER annotations?
How might this affect the model's performance on texts where particles may vary or be omitted?
Are there any potential drawbacks to this approach that I should be aware of?
I would appreciate any insights, explanations, or suggestions on how to best approach this issue while maintaining the integrity and flexibility of the NER model for Korean language processing.
You can only train using entities that align with token boundaries. Here are more details about the tokenizer used in the provided Korean pipelines: #10624. It's not a great fit for NER for Korean, but the provided pipeline needed all components to use the same tokenization.
If you're only doing NER, you can consider using the default mecab-ko-based Korean tokenizer (or use your own custom tokenizer): https://spacy.io/usage/models#korean
Body:
When training a Named Entity Recognition (NER) model for Korean using spaCy, I've encountered an interesting phenomenon that significantly affects the model's accuracy. I'm using the updated NER pipeline to create the configuration file.
Issue:
The model's accuracy improves substantially when I include particles (조사, josa) along with the annotated nouns in the training data.
Example:
In the second example, the particle "가" is included within the entity annotation.
Questions:
I would appreciate any insights, explanations, or suggestions on how to best approach this issue while maintaining the integrity and flexibility of the NER model for Korean language processing.
How to reproduce the behaviour
config.cfg:
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0
[nlp]
lang = "ko"
pipeline = ["tok2vec","tagger","morphologizer","parser","lemmatizer","senter","attribute_ruler","ner"]
disabled = ["senter"]
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 256
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}
[components]
[components.attribute_ruler]
source = "ko_core_news_lg"
[components.lemmatizer]
source = "ko_core_news_lg"
[components.morphologizer]
source = "ko_core_news_lg"
[components.ner]
source = "ko_core_news_lg"
replace_listeners = ["model.tok2vec"]
[components.parser]
source = "ko_core_news_lg"
[components.senter]
source = "ko_core_news_lg"
[components.tagger]
source = "ko_core_news_lg"
[components.tok2vec]
source = "ko_core_news_lg"
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null
[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 1000
frozen_components = ["tok2vec","tagger","morphologizer","parser","lemmatizer","senter","attribute_ruler"]
before_to_disk = null
annotating_components = []
before_update = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@Loggers = "spacy.ConsoleLogger.v3"
console_output = true
output_file = "/media//trainer_log.jsonl"
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
tag_acc = 0.1
pos_acc = 0.1
morph_acc = 0.09
morph_per_feat = null
dep_uas = 0.0
dep_las = 0.29
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.04
lemma_acc = 0.1
ents_f = 0.29
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
speed = 0.0
[pretraining]
[initialize]
vocab_data = null
vectors = null
init_tok2vec = ${paths.init_tok2vec}
after_init = null
lookups = null
[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "ko_core_news_lg"
vocab = "ko_core_news_lg"
[initialize.components]
[initialize.components.lemmatizer]
[initialize.components.lemmatizer.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/trainable_lemmatizer.json"
require = false
[initialize.components.morphologizer]
[initialize.components.morphologizer.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/morphologizer.json"
require = false
[initialize.components.ner]
[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json"
require = false
[initialize.components.parser]
[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/parser.json"
require = false
[initialize.components.tagger]
[initialize.components.tagger.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false
[initialize.tokenizer]
Your Environment
Info about spaCy
The text was updated successfully, but these errors were encountered: