Issues with stopwords when working with make_doc_from_text_chunks #230

stuartspotlight · 2019-02-01T12:15:29Z

I'm having numerous issues with stopwords when working with textacy's make_doc_from_text_chunks functionality.

Expected Behavior

I want to be able to load a model and then fire documents at it in order to find keywords. I want to do this in a way whereupon I can reset the stopwords I'm using from document to document.

Current Behavior

Setting stopwords for the first document works fine but when I attempt to reset the stopwords for the next document it appears to revert the stopwords back to the default and not allow me to use a new, custom set of stopwords. It also seems to miss some stopwords on the first pass.

Possible Solution

I think a flag is being set somewhere in textacy when I call make_doc_from_chunks to set the stopwords and I can't for the life of me find a way to unset it. I would say this is a bug somewhere.

Steps to Reproduce (for bugs)

In order to ensure reproducibility I have provided both some example python code showing the bug and a Dockerfile (in the environment section) which should make it easy to reproduce the problem. Example code:-

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Feb  1 10:40:22 2019

@author: stuart
"""

#reporting textacys stopwords bugs

import spacy
import textacy

#This is an example document I've made to show this issue happening
t = '''Here is an example document. It has a number of words. It is a good document.

Documents are good. Document for Documents.

Apple is a company worth over $1tr. We have to ask how many documents can a person write in a week.
The word documents is being deliberately overused. Just document it! Apples are a fruit I'm interested in.
How do you feel about apples, I'm a big fan of Apples. Is it all Apples or just the ones at the end of a sentence?'''


#problem 1. Unable to change stopwords of initiated model

#initiate our model
model = spacy.load('en_core_web_sm')

#set the first set of stopwords
example_stops1 = ['Apples', 'Apple', 'apples', 'apple']

#add the stopwords to the model
model.Defaults.stop_words |= set(example_stops1)

#create a document using the make doc from text tool in order to avoid problems
#with massive documents
doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model)

#check that the stopwords have been corectly identified
for word in doc:
    
    if word.is_stop:
        print(word)

print("=====================")

#remove the first set of stopwords from the list of stopwords, this seems to work
#ok
model.Defaults.stop_words -= set(example_stops1)

#now set another set of stopwords 
example_stops2 = ['Document', 'Documents', 'document', 'documents']

#add the new set of stopwords to the model
model.Defaults.stop_words |= set(example_stops2)

#demonstrate that all the stopwords we want to be there are there
print("+++++++++++++++++++++++++++++++++++++")
print(model.Defaults.stop_words)
print("+++++++++++++++++++++++++++++++++++++")

#create a second document without our new set of stopwords
doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model) 

#check that stopwords set 2 have been correctly identified
for word in doc:
    
    if word.is_stop:
        print(word)
    else:
        if word.text in example_stops2:
            print(word, word.is_stop)
        
print("+++++++++++++++++++++++")

Details of the docker container are given in the environment section.

Context

I want to create a tool which produces keywords from arbitrarily large documents with stopwords set based on their context which does not require a restart when processing a different set of documents. For example a series of financial reports should not return "fiscal" or "financial" in their keywords and the tool should not have to restart to process a series of performance reviews with "performance" set as a stopword.

Your Environment

Run in a docker container using the following code:-

FROM python:3.6

RUN apt-get update
RUN apt-get install build-essential -y

#install basic requirements
ADD ./requirements.txt /
RUN pip install -r /requirements.txt

#add nltk model
RUN python -m nltk.downloader 'punkt'

#Install spacy model
RUN python -m spacy download en

ADD reporting_textacys_stopwords_bug.py /

CMD ["python", "reporting_textacys_stopwords_bug.py"]

and the requirements file is:-

numpy==1.16.1
nltk==3.3
rake-nltk==1.0.1
scipy==1.0.0
spacy==2.0.18
textacy==0.6.2

operating system: Ubuntu (dockerised)
python version: 3.6
spacy version: 2.0.18
installed spacy models: en,en_core_web_sm
textacy version: 0.6.2

The text was updated successfully, but these errors were encountered:

stuartspotlight · 2019-02-01T13:43:42Z

I've been experimenting with this and I think I've found a work around however it may be very computationally inefficient. I initiate the document, then set the stopwords then re-initiate the document. This not only seems to solve the issue with not being able to reset stopwords but it also seems to fix the issue with some stopwords not being picked up on the first pass. The need to do this is very odd behavior however:

import spacy
import textacy



def add_stopwords_in(doc, stopwords):
    
    
    for word in stopwords:
        
        doc.vocab[word].is_stop=True
        
    doc = textacy.spacier.utils.make_doc_from_text_chunks(doc.text, lang=model)
    
    return doc
        

#This is an example document I've made to show this issue happening
t = '''Here is an example document. It has a number of words. It is a good document.

Documents are good. Document for Documents.

Apple is a company worth over $1tr. We have to ask how many documents can a person write in a week.
The word documents is being deliberately overused. Just document it! Apples are a fruit I'm interested in.
How do you feel about apples, I'm a big fan of Apples. Is it all Apples or just the ones at the end of a sentence?'''

#initiate our model
model = spacy.load('en_core_web_sm')

#set the first set of stopwords
example_stops1 = ['Apples', 'Apple', 'apples', 'apple']

#add the stopwords to the model
#model.Defaults.stop_words |= set(example_stops1)

#create a document using the make doc from text tool in order to avoid problems
#with massive documents
doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model)

doc = add_stopwords_in(doc, example_stops1)
#check that the stopwords have been corectly identified
for word in doc:
    
    if word.is_stop:
        print(word)
    else:
        if word.text in example_stops1:
            print(word, word.is_stop)

print("=====================")

del doc



doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model)

#remove the old stopwords
for word in example_stops1:
    doc.vocab[word].is_stop=False


#now set another set of stopwords 
example_stops2 = ['Document', 'Documents', 'document', 'documents']

#add the new stopwords in
doc = add_stopwords_in(doc, example_stops2)
    
#check that the stopwords have been corectly identified
for word in doc:
    
    if word.is_stop:
        print(word)
    else:
        if word.text in example_stops2:
            print(word, word.is_stop)

print("=====================")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with stopwords when working with make_doc_from_text_chunks #230

Issues with stopwords when working with make_doc_from_text_chunks #230

stuartspotlight commented Feb 1, 2019

stuartspotlight commented Feb 1, 2019

Issues with stopwords when working with make_doc_from_text_chunks #230

Issues with stopwords when working with make_doc_from_text_chunks #230

Comments

stuartspotlight commented Feb 1, 2019

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

stuartspotlight commented Feb 1, 2019