Optimal language tokenization strategy for multilingual NER with transformers #13699
Unanswered
didmar
asked this question in
Help: Other Questions
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm developing a multilingual NER model using spaCy/Prodigy and need advice on spaCy's language tokenization.
Our use case : includes OCR-generated text (e.g., missing spaces) and annotated spans such "Allemagne" in "l'Allemagne", "123" in "$123", ...
We are currently using spaCy's xx language but this does not work well with those constraints. For example, the model will detect a MONEY span for "456" in "$ 456", and "$123" in "$123" (we do want to exclude the symbol, but this is not always possible depending on the tokenization). This could be resolved through post-processing, but it would be much better to have the model learn consistently in the first place!
I'm considering a few approaches:
Custom tokenizer with manual rules
✓ Seems to aligns with spaCy's design philosophy
✗ Labor-intensive
✗ May fail on unforeseen cases (e.g. new examples that we add later on)
Transformer-aligned language tokenizer
✓ Already matches training token granularity, which seems to fit all our constraints
✗ Locks us into specific transformer tokenization scheme (if we want to change the base model)
Character-level language tokenizer
✓ Maximum flexibility for span boundaries
✓ Future-proof (no assumptions made)
✗ Can spaCy transform the spans to fit the transformer tokens?
Whitespace padding pre-processing
✓ Simple implementation
✗ Additional processing step at inference, and to map the spans back to the original text
Has anyone faced similar challenges or can suggest alternative approaches that better align with spaCy's architecture?
Beta Was this translation helpful? Give feedback.
All reactions