-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-word token expansion issue, misaligned tokens --> failed NER (German) #70
Comments
The background is that we originally developed We can't "simply proceed" because the code currently uses the character offsets to add the NER annotation, so if they don't align with the text anymore, it's not trivial to add the annotation to the doc. I think it should be possible to use information from the I'm not sure there's currently a good workaround involving preprocessing. If you only need NER annotation, you could try a pipeline with only The stanza |
Thanks a lot for the explanations! Seems like it's impossible to address
well. Also thanks for the proposed solutions, I will try both things.
Should I close this issue?
Am 2021-05-03 10:16, schrieb Adriane Boyd:
… The background is that we originally developed spacy-stanza before NER
components were added, so we focused on providing access to the
morpho-syntactic annotation, which is annotated on the expanded
multi-word tokens rather than on the original text tokens. Since a
spacy Doc can only represent one layer of tokenization, we use the
expanded multi-word tokens in the returned Doc.
We can't "simply proceed" because the code currently uses the
character offsets to add the NER annotation, so if they don't align
with the text anymore, it's not trivial to add the annotation to the
doc. I think it should be possible to use information from the
Document to align the annotations, but it would require some updates
to the alignment alignment in spacy-stanza. (If this is something
you'd like to work on, PRs are welcome!)
I'm not sure there's currently a good workaround involving
preprocessing. If you only need NER annotation, you could try a
pipeline with only tokenize and ner, but I'm not sure whether the ner
component depends on the mwt output or not. It's possible it would
fail to run, it would run with degraded performance, or it would be
totally fine. From a quick look at the docs and the code, I'm not sure
which one it would be.
The stanza Document objects do support both layers of annotation, so
for now you might consider using stanza directly?
--
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub [1], or unsubscribe
[2].
Links:
------
[1]
#70 (comment)
[2]
https://github.com/notifications/unsubscribe-auth/AOIPI5BRSZBFYGUWEWRYHE3TLZLVFANCNFSM436DSLYA
|
I think it's fine to leave it open. It's not going to be a high priority for us to work on right now, but since I think it should be possible to improve this part of the alignment, this will remind us in the future. |
If you still have an issue with this, or for anyone coming after me and having the same issue. I was able fix this "quick and dirty".
this generates for each multi word token only one word. The text is the actual text of the token and the lemma is the conjunction of the words. The type I fixed to the type of the first word in my multiword token. this solution mainly has in mind the cases of "am, vom, ins, zum" which are always shorts for a preposition and an artikel. Since I consider the artikel to not contain as much information as the preposition, I "copied" the preposition content and overwrote the "text" and "lemma" attributes. |
Hi,
thanks for the great project! It seems like stanza performs some pre-processing to the text, which results in misalignments and failed NER.
UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer...
Wouldn't it be a good solution for this to inform the user with a warning ("stanza performs extra preporcessing to the text... input: xxxx, output: yyyy, char indeces may be altered) and then simply proceed? I can imagine that many users, me included, are not fully interested in that char offset n remains char offset n after processing.
Or is there some way to somehow execute the "stanza-custom" preprocessing before creating a doc with nlp(...)? This would also prevent the misalignments and gives more user control. Or is there some other fix that I'm not aware of?
spacy version: 3.0.6
stanza-spacy version: 1.2
The text was updated successfully, but these errors were encountered: