Optimal language tokenization strategy for multilingual NER with transformers #13699

didmar · 2024-11-28T13:38:50Z

didmar
Nov 28, 2024

I'm developing a multilingual NER model using spaCy/Prodigy and need advice on spaCy's language tokenization.
Our use case : includes OCR-generated text (e.g., missing spaces) and annotated spans such "Allemagne" in "l'Allemagne", "123" in "$123", ...
We are currently using spaCy's xx language but this does not work well with those constraints. For example, the model will detect a MONEY span for "456" in "$ 456", and "$123" in "$123" (we do want to exclude the symbol, but this is not always possible depending on the tokenization). This could be resolved through post-processing, but it would be much better to have the model learn consistently in the first place!

I'm considering a few approaches:

Custom tokenizer with manual rules
✓ Seems to aligns with spaCy's design philosophy
✗ Labor-intensive
✗ May fail on unforeseen cases (e.g. new examples that we add later on)
Transformer-aligned language tokenizer
✓ Already matches training token granularity, which seems to fit all our constraints
✗ Locks us into specific transformer tokenization scheme (if we want to change the base model)
Character-level language tokenizer
✓ Maximum flexibility for span boundaries
✓ Future-proof (no assumptions made)
✗ Can spaCy transform the spans to fit the transformer tokens?
Whitespace padding pre-processing
✓ Simple implementation
✗ Additional processing step at inference, and to map the spans back to the original text

Has anyone faced similar challenges or can suggest alternative approaches that better align with spaCy's architecture?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal language tokenization strategy for multilingual NER with transformers #13699

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Optimal language tokenization strategy for multilingual NER with transformers #13699

didmar Nov 28, 2024

Replies: 0 comments

didmar
Nov 28, 2024