Offset misalignment in NER using the Stanza tokenizer for French #32

vitojph · 2020-04-15T16:46:05Z

Hi everyone,

I just found a problem when trying to analyze a French sentence. When I run the following code:

snlp = stanza.Pipeline(lang="fr", verbose=False)
stanzanlp = StanzaLanguage(snlp)

text = "C'est l'un des grands messages passés par Bruno Le Maire, ce matin sur RTL."
doc = stanzanlp(text)

I get this error:

/home/victor/miniconda3/envs/nlp/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ["C'", 'est', "l'", 'un', 'de', 'les', 'grands', 'messages', 'passés', 'par', 'Bruno', 'Le', 'Maire', ',', 'ce', 'matin', 'sur', 'RTL.']
Entities: [('Bruno Le Maire', 'PER', 42, 56), ('RTL.', 'ORG', 71, 75)]
  after removing the cwd from sys.path.

Analyzing the same text with the default French model in spaCy, I get almost the same tokens: take a look at the final stop.

doc = spacynlp(text)

for token in doc:
    print(token.text, token.idx)
    
for ent in doc.ents:
    print(ent.text, ent.label_)

C' 0
est 2
l' 6
un 8
des 11
grands 15
messages 22
passés 31
par 38
Bruno 42
Le 48
Maire 51
, 56
ce 58
matin 61
sur 67
RTL 71
. 74
Bruno Le Maire PER
RTL ORG

Is anyone having the same issues?

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-06-25T18:52:49Z

The issue is the multi-word token expansion of des to de les, which throws off the character-based entity spans. A spacy Doc is only able to represent one layer of token segmentation (not both des and de les in the same Doc), so to prioritize the POS tags and dependency annotation, the Doc returned here modifies the original text to use the expanded tokens instead of the original words. (To be clear, this goes against spacy's normal non-destructive tokenization principles, but it makes things simpler for the purposes of this wrapper.)

The entity annotation returned by stanza is based on character offsets in the original text, which can't be aligned with the expanded tokens, at least not without a lot of effort.

We've added some more informative warnings in #27, which should be in the next release (v0.2.3, I think).

bablf · 2021-12-30T17:40:56Z

Hey I got the same error message when working with spacy/spacy_stanza/CoreNLP and I found a possible solution. I will post this here since this is the first result when googling the error.

The problem between stanza/CoreNLP and spaCy is the mismatch in tokenization. It's really difficult to map the different tokenizations onto each other. The trick is to call the stanza tokenization first (CoreNLPClient), extract the words and the start of each sentence (when working with documents containing several sentences).

Then you can create a spaCy Doc-object and give it to the spaCy pipeline like this nlp(Doc(nlp.vocab, words=words, sent_starts=sent_starts, ents=entities))

I haven't tried this yet but I think you can also extract the entities from the stanza/CoreNLP result and pass them to the Doc object (see above). But you have to create the Spans for the Entities yourself.

Edit: Alternatively you can create rules for the spaCy-tokenizer but that would be really tedious.

bablf · 2022-01-18T06:49:37Z

My above solution works for most languages (german, english etc.) but when using a language that spacy does not have a vocab for it kind of does not want to do the named entity recognition and other processing steps (see issue #82).

I found another workaround that seems to work just fine. Use CoreNLPClient to tokenize as described before, but this time just join the words and then call the Pipeline like this:

nlp = spacy_stanza.load_pipeline("xx", lang=self.lang,
                                 processors='tokenize, pos, lemma, depparse, ner',
                                 use_gpu=True)
result = nlp(" ".join(words))

AlexanderPoone · 2023-01-18T08:09:59Z

The issue is the multi-word token expansion of des to de les, which throws off the character-based entity spans. A spacy Doc is only able to represent one layer of token segmentation (not both des and de les in the same Doc), so to prioritize the POS tags and dependency annotation, the Doc returned here modifies the original text to use the expanded tokens instead of the original words. (To be clear, this goes against spacy's normal non-destructive tokenization principles, but it makes things simpler for the purposes of this wrapper.)

The entity annotation returned by stanza is based on character offsets in the original text, which can't be aligned with the expanded tokens, at least not without a lot of effort.

We've added some more informative warnings in #27, which should be in the next release (v0.2.3, I think).

Indeed. du -> de le in French, del -> de el in Spanish, etc!

AlexanderPoone · 2023-01-27T09:07:43Z

Needs a workaround for Arabic. Still occasionally fails for all 'workarounds' mentioned on Issues.

bablf mentioned this issue Jan 17, 2022

NER & Parsing not working for new language #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offset misalignment in NER using the Stanza tokenizer for French #32

Offset misalignment in NER using the Stanza tokenizer for French #32

vitojph commented Apr 15, 2020

adrianeboyd commented Jun 25, 2020

bablf commented Dec 30, 2021 •

edited

Loading

bablf commented Jan 18, 2022

AlexanderPoone commented Jan 18, 2023

AlexanderPoone commented Jan 27, 2023

Offset misalignment in NER using the Stanza tokenizer for French #32

Offset misalignment in NER using the Stanza tokenizer for French #32

Comments

vitojph commented Apr 15, 2020

adrianeboyd commented Jun 25, 2020

bablf commented Dec 30, 2021 • edited Loading

bablf commented Jan 18, 2022

AlexanderPoone commented Jan 18, 2023

AlexanderPoone commented Jan 27, 2023

bablf commented Dec 30, 2021 •

edited

Loading