Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking seems to not be working properly #326

Open
elinantonsson opened this issue Feb 25, 2021 · 4 comments
Open

Chunking seems to not be working properly #326

elinantonsson opened this issue Feb 25, 2021 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@elinantonsson
Copy link

Hello and thank you for creating this tool!

I have been trying to use the noun_chunks with your pipelines but it does not seem to be working correctly. I have tried with en_core_sci_sm, en_core_sci_md and en_core_sci_lg. For example when I input the sentence "CCR5(+) and CXCR3(+) T cells are increased in multiple sclerosis and their ligands MIP-1alpha and IP-10 are expressed in demyelinating brain lesions." I only get "CCR5(+", "CXCR3(+" and "T cells" as chunks and I would expect more chunks. For example, using spaCys en_core_web_trf I get "CCR5(+) and CXCR3(+) T cells", "multiple sclerosis", "their ligands", "MIP-1alpha", "IP-10" and "brain lesions".

Is the chunking supposed to work in a similar way as spaCys pipelines or have I misinterpreted something?

Thank you in advance!

Best regards

@dakinggg
Copy link
Collaborator

so, I think this is due to differences in the dependency parse. our dependency parser is more accurate on biomedical data (but different from spacy's), and spacy's noun chunker is defined here (https://github.com/explosion/spaCy/blob/a59f3fcf5dab3acf5570483cc314b47cc5833f39/spacy/lang/en/syntax_iterators.py#L8), with respect to specific dependency relations. See an example of the difference for your sentence below. Perhaps we should write our own noun chunker based on our dependency parser, but I am really not an expert in linguistics. You might get some mileage from adapting spacy's noun chunker based on patterns you observe from our dependency parser. Also, @DeNeutoy do you have any thoughts about this?

In [14]: [(t.text, t.pos_, t.dep_) for t in sci_doc]
Out[14]: 
[('CCR5(+', 'NOUN', 'nsubjpass'),
 (')', 'PUNCT', 'punct'),
 ('and', 'CCONJ', 'cc'),
 ('CXCR3(+', 'NOUN', 'compound'),
 (')', 'PUNCT', 'punct'),
 ('T', 'NOUN', 'compound'),
 ('cells', 'NOUN', 'conj'),
 ('are', 'VERB', 'auxpass'),
 ('increased', 'VERB', 'ROOT'),
 ('in', 'ADP', 'case'),
 ('multiple', 'ADJ', 'amod'),
 ('sclerosis', 'NOUN', 'nmod'),
 ('and', 'CCONJ', 'cc'),
 ('their', 'PRON', 'nmod:poss'),
 ('ligands', 'NOUN', 'conj'),
 ('MIP-1alpha', 'NOUN', 'dep'),
 ('and', 'CCONJ', 'cc'),
 ('IP-10', 'NOUN', 'conj'),
 ('are', 'VERB', 'auxpass'),
 ('expressed', 'VERB', 'conj'),
 ('in', 'ADP', 'case'),
 ('demyelinating', 'VERB', 'amod'),
 ('brain', 'NOUN', 'compound'),
 ('lesions', 'NOUN', 'nmod'),
 ('.', 'PUNCT', 'punct')]

In [15]: [(t.text, t.pos_, t.dep_) for t in web_doc]
Out[15]: 
[('CCR5(+', 'NOUN', 'ROOT'),
 (')', 'PUNCT', 'punct'),
 ('and', 'CCONJ', 'cc'),
 ('CXCR3(+', 'PROPN', 'npadvmod'),
 (')', 'PUNCT', 'punct'),
 ('T', 'NOUN', 'compound'),
 ('cells', 'NOUN', 'nsubjpass'),
 ('are', 'AUX', 'auxpass'),
 ('increased', 'VERB', 'conj'),
 ('in', 'ADP', 'prep'),
 ('multiple', 'ADJ', 'amod'),
 ('sclerosis', 'NOUN', 'pobj'),
 ('and', 'CCONJ', 'cc'),
 ('their', 'PRON', 'poss'),
 ('ligands', 'NOUN', 'conj'),
 ('MIP-1alpha', 'PROPN', 'appos'),
 ('and', 'CCONJ', 'cc'),
 ('IP-10', 'NUM', 'conj'),
 ('are', 'AUX', 'auxpass'),
 ('expressed', 'VERB', 'conj'),
 ('in', 'ADP', 'prep'),
 ('demyelinating', 'VERB', 'amod'),
 ('brain', 'NOUN', 'compound'),
 ('lesions', 'NOUN', 'pobj'),
 ('.', 'PUNCT', 'punct')]

@dakinggg
Copy link
Collaborator

Did you have any luck adapting the noun chunker?

@elinantonsson
Copy link
Author

Sorry for my late response. Your answer was very helpful! I decided to try a different approach since I did not have enough time in my project to look for these patterns. Thank you very much!

@dakinggg dakinggg added enhancement New feature or request help wanted Extra attention is needed labels Mar 24, 2021
@annahmrichardson
Copy link

has anyone worked on a scispacy noun chunker? thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants