-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offset misalignment in NER using the Stanza tokenizer for French #32
Comments
The issue is the multi-word token expansion of The entity annotation returned by stanza is based on character offsets in the original text, which can't be aligned with the expanded tokens, at least not without a lot of effort. We've added some more informative warnings in #27, which should be in the next release (v0.2.3, I think). |
Hey I got the same error message when working with spacy/spacy_stanza/CoreNLP and I found a possible solution. I will post this here since this is the first result when googling the error. The problem between stanza/CoreNLP and spaCy is the mismatch in tokenization. It's really difficult to map the different tokenizations onto each other. The trick is to call the stanza tokenization first (CoreNLPClient), extract the words and the start of each sentence (when working with documents containing several sentences). Then you can create a spaCy Doc-object and give it to the spaCy pipeline like this I haven't tried this yet but I think you can also extract the entities from the stanza/CoreNLP result and pass them to the Doc object (see above). But you have to create the Spans for the Entities yourself. Edit: Alternatively you can create rules for the spaCy-tokenizer but that would be really tedious. |
My above solution works for most languages (german, english etc.) but when using a language that spacy does not have a vocab for it kind of does not want to do the named entity recognition and other processing steps (see issue #82). I found another workaround that seems to work just fine. Use CoreNLPClient to tokenize as described before, but this time just join the words and then call the Pipeline like this:
|
Indeed. |
Needs a workaround for Arabic. Still occasionally fails for all 'workarounds' mentioned on Issues. |
Hi everyone,
I just found a problem when trying to analyze a French sentence. When I run the following code:
I get this error:
Analyzing the same text with the default French model in spaCy, I get almost the same tokens: take a look at the final stop.
Is anyone having the same issues?
The text was updated successfully, but these errors were encountered: