Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offset misalignment in NER using the Stanza tokenizer for French #32

Open
vitojph opened this issue Apr 15, 2020 · 5 comments
Open

Offset misalignment in NER using the Stanza tokenizer for French #32

vitojph opened this issue Apr 15, 2020 · 5 comments

Comments

@vitojph
Copy link

vitojph commented Apr 15, 2020

Hi everyone,

I just found a problem when trying to analyze a French sentence. When I run the following code:

snlp = stanza.Pipeline(lang="fr", verbose=False)
stanzanlp = StanzaLanguage(snlp)

text = "C'est l'un des grands messages passés par Bruno Le Maire, ce matin sur RTL."
doc = stanzanlp(text)

I get this error:

/home/victor/miniconda3/envs/nlp/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ["C'", 'est', "l'", 'un', 'de', 'les', 'grands', 'messages', 'passés', 'par', 'Bruno', 'Le', 'Maire', ',', 'ce', 'matin', 'sur', 'RTL.']
Entities: [('Bruno Le Maire', 'PER', 42, 56), ('RTL.', 'ORG', 71, 75)]
  after removing the cwd from sys.path.

Analyzing the same text with the default French model in spaCy, I get almost the same tokens: take a look at the final stop.

doc = spacynlp(text)

for token in doc:
    print(token.text, token.idx)
    
for ent in doc.ents:
    print(ent.text, ent.label_)
C' 0
est 2
l' 6
un 8
des 11
grands 15
messages 22
passés 31
par 38
Bruno 42
Le 48
Maire 51
, 56
ce 58
matin 61
sur 67
RTL 71
. 74
Bruno Le Maire PER
RTL ORG

Is anyone having the same issues?

@adrianeboyd
Copy link
Contributor

The issue is the multi-word token expansion of des to de les, which throws off the character-based entity spans. A spacy Doc is only able to represent one layer of token segmentation (not both des and de les in the same Doc), so to prioritize the POS tags and dependency annotation, the Doc returned here modifies the original text to use the expanded tokens instead of the original words. (To be clear, this goes against spacy's normal non-destructive tokenization principles, but it makes things simpler for the purposes of this wrapper.)

The entity annotation returned by stanza is based on character offsets in the original text, which can't be aligned with the expanded tokens, at least not without a lot of effort.

We've added some more informative warnings in #27, which should be in the next release (v0.2.3, I think).

@bablf
Copy link

bablf commented Dec 30, 2021

Hey I got the same error message when working with spacy/spacy_stanza/CoreNLP and I found a possible solution. I will post this here since this is the first result when googling the error.

The problem between stanza/CoreNLP and spaCy is the mismatch in tokenization. It's really difficult to map the different tokenizations onto each other. The trick is to call the stanza tokenization first (CoreNLPClient), extract the words and the start of each sentence (when working with documents containing several sentences).

Then you can create a spaCy Doc-object and give it to the spaCy pipeline like this nlp(Doc(nlp.vocab, words=words, sent_starts=sent_starts, ents=entities))

I haven't tried this yet but I think you can also extract the entities from the stanza/CoreNLP result and pass them to the Doc object (see above). But you have to create the Spans for the Entities yourself.

Edit: Alternatively you can create rules for the spaCy-tokenizer but that would be really tedious.

@bablf
Copy link

bablf commented Jan 18, 2022

My above solution works for most languages (german, english etc.) but when using a language that spacy does not have a vocab for it kind of does not want to do the named entity recognition and other processing steps (see issue #82).

I found another workaround that seems to work just fine. Use CoreNLPClient to tokenize as described before, but this time just join the words and then call the Pipeline like this:

nlp = spacy_stanza.load_pipeline("xx", lang=self.lang,
                                 processors='tokenize, pos, lemma, depparse, ner',
                                 use_gpu=True)
result = nlp(" ".join(words))

@AlexanderPoone
Copy link

The issue is the multi-word token expansion of des to de les, which throws off the character-based entity spans. A spacy Doc is only able to represent one layer of token segmentation (not both des and de les in the same Doc), so to prioritize the POS tags and dependency annotation, the Doc returned here modifies the original text to use the expanded tokens instead of the original words. (To be clear, this goes against spacy's normal non-destructive tokenization principles, but it makes things simpler for the purposes of this wrapper.)

The entity annotation returned by stanza is based on character offsets in the original text, which can't be aligned with the expanded tokens, at least not without a lot of effort.

We've added some more informative warnings in #27, which should be in the next release (v0.2.3, I think).

Indeed. du -> de le in French, del -> de el in Spanish, etc!

@AlexanderPoone
Copy link

Needs a workaround for Arabic. Still occasionally fails for all 'workarounds' mentioned on Issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants