Chunking seems to not be working properly #326

elinantonsson · 2021-02-25T08:45:01Z

Hello and thank you for creating this tool!

I have been trying to use the noun_chunks with your pipelines but it does not seem to be working correctly. I have tried with en_core_sci_sm, en_core_sci_md and en_core_sci_lg. For example when I input the sentence "CCR5(+) and CXCR3(+) T cells are increased in multiple sclerosis and their ligands MIP-1alpha and IP-10 are expressed in demyelinating brain lesions." I only get "CCR5(+", "CXCR3(+" and "T cells" as chunks and I would expect more chunks. For example, using spaCys en_core_web_trf I get "CCR5(+) and CXCR3(+) T cells", "multiple sclerosis", "their ligands", "MIP-1alpha", "IP-10" and "brain lesions".

Is the chunking supposed to work in a similar way as spaCys pipelines or have I misinterpreted something?

Thank you in advance!

Best regards

dakinggg · 2021-02-25T19:09:05Z

so, I think this is due to differences in the dependency parse. our dependency parser is more accurate on biomedical data (but different from spacy's), and spacy's noun chunker is defined here (https://github.com/explosion/spaCy/blob/a59f3fcf5dab3acf5570483cc314b47cc5833f39/spacy/lang/en/syntax_iterators.py#L8), with respect to specific dependency relations. See an example of the difference for your sentence below. Perhaps we should write our own noun chunker based on our dependency parser, but I am really not an expert in linguistics. You might get some mileage from adapting spacy's noun chunker based on patterns you observe from our dependency parser. Also, @DeNeutoy do you have any thoughts about this?

In [14]: [(t.text, t.pos_, t.dep_) for t in sci_doc]
Out[14]: 
[('CCR5(+', 'NOUN', 'nsubjpass'),
 (')', 'PUNCT', 'punct'),
 ('and', 'CCONJ', 'cc'),
 ('CXCR3(+', 'NOUN', 'compound'),
 (')', 'PUNCT', 'punct'),
 ('T', 'NOUN', 'compound'),
 ('cells', 'NOUN', 'conj'),
 ('are', 'VERB', 'auxpass'),
 ('increased', 'VERB', 'ROOT'),
 ('in', 'ADP', 'case'),
 ('multiple', 'ADJ', 'amod'),
 ('sclerosis', 'NOUN', 'nmod'),
 ('and', 'CCONJ', 'cc'),
 ('their', 'PRON', 'nmod:poss'),
 ('ligands', 'NOUN', 'conj'),
 ('MIP-1alpha', 'NOUN', 'dep'),
 ('and', 'CCONJ', 'cc'),
 ('IP-10', 'NOUN', 'conj'),
 ('are', 'VERB', 'auxpass'),
 ('expressed', 'VERB', 'conj'),
 ('in', 'ADP', 'case'),
 ('demyelinating', 'VERB', 'amod'),
 ('brain', 'NOUN', 'compound'),
 ('lesions', 'NOUN', 'nmod'),
 ('.', 'PUNCT', 'punct')]

In [15]: [(t.text, t.pos_, t.dep_) for t in web_doc]
Out[15]: 
[('CCR5(+', 'NOUN', 'ROOT'),
 (')', 'PUNCT', 'punct'),
 ('and', 'CCONJ', 'cc'),
 ('CXCR3(+', 'PROPN', 'npadvmod'),
 (')', 'PUNCT', 'punct'),
 ('T', 'NOUN', 'compound'),
 ('cells', 'NOUN', 'nsubjpass'),
 ('are', 'AUX', 'auxpass'),
 ('increased', 'VERB', 'conj'),
 ('in', 'ADP', 'prep'),
 ('multiple', 'ADJ', 'amod'),
 ('sclerosis', 'NOUN', 'pobj'),
 ('and', 'CCONJ', 'cc'),
 ('their', 'PRON', 'poss'),
 ('ligands', 'NOUN', 'conj'),
 ('MIP-1alpha', 'PROPN', 'appos'),
 ('and', 'CCONJ', 'cc'),
 ('IP-10', 'NUM', 'conj'),
 ('are', 'AUX', 'auxpass'),
 ('expressed', 'VERB', 'conj'),
 ('in', 'ADP', 'prep'),
 ('demyelinating', 'VERB', 'amod'),
 ('brain', 'NOUN', 'compound'),
 ('lesions', 'NOUN', 'pobj'),
 ('.', 'PUNCT', 'punct')]

dakinggg · 2021-03-19T23:07:21Z

Did you have any luck adapting the noun chunker?

elinantonsson · 2021-03-22T10:27:59Z

Sorry for my late response. Your answer was very helpful! I decided to try a different approach since I did not have enough time in my project to look for these patterns. Thank you very much!

annahmrichardson · 2021-09-22T16:07:29Z

has anyone worked on a scispacy noun chunker? thanks !

dakinggg added enhancement New feature or request help wanted Extra attention is needed labels Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking seems to not be working properly #326

Chunking seems to not be working properly #326

elinantonsson commented Feb 25, 2021

dakinggg commented Feb 25, 2021

dakinggg commented Mar 19, 2021

elinantonsson commented Mar 22, 2021

annahmrichardson commented Sep 22, 2021

Chunking seems to not be working properly #326

Chunking seems to not be working properly #326

Comments

elinantonsson commented Feb 25, 2021

dakinggg commented Feb 25, 2021

dakinggg commented Mar 19, 2021

elinantonsson commented Mar 22, 2021

annahmrichardson commented Sep 22, 2021