A tool for iteratively building a COO matrix from a large corpus of text data
A COO matrix (aka 'ijv' or triplet format) is an abstract way of storing a matrix as three arrays, that uses less memory if the matrix is sparse will be sparse.
For example, if you are tracking word occurrences in 100,000 (M) documents and your set of words of interest is 10,000 (N) that is 1,000,000,000 cells of information (or integers) if stored as an MxN array.
However if most of these documents only contain some subset with size of about 200 (S) of the words of interest, you can exploit the fact that this matrix will be very sparse (most values equal to zero). Since only 20,000,000 (SxM) will be non-zero, you can store the ijv-values (the row index, the column index and the cell value) separately in three arrays of size 20,000,000, meaning you only need to store 60,000,000 integers. This is a 94% reduction in memory usage (3 x (100 x [N-S]/N)%)
Instead of storing this matrix directly. You can store
You're doing a bag-of-words analysis on some large N
number of documents.
You want to build a matrix where:
- each row corresponds to a document
- each column corresponds to a word
- the value in the cell is the # of times that word occured in that document
You want to build this matrix iteratively
- How many columns do you need?
- How do you efficiently build this object iteratively
For small datasets this is fine. You can use a list of lists, if you find a new word in a document and need to create a "new column" you just create a new list. For large datasets that becomes a huge memory issue.
- Use a COO matrix.
The coo-builder
module is based on this guide to building a COO but it solves the additional problem of how to build a COO matrix iteratively i.e. how do you build it when you don't know ex-ante what shape it will be (which will often be the case if your list of "words of interests" grows as you iteratively read more documents).
from coo_builder import COOBuilder
builder = COOBuilder()
Iterate over each file/document in your dataset and for each one return a Counter
object that maps words to counts e.g. {"hello":3, "world":5}
. Then add this counter to the COO by using the add_doc_counter
method
for file_ind, file in enumerate(some_iterable_of_files):
### Read/process the file
### Do any sort of formatting, filtering or removal of stop words
### End up with a mapping from words (str) to counts (int)
term_counter = # a Counter object (or dictionary)
### Add to the builder
builder.add_doc_counter(file_ind,term_counter)
The file_ind
is assumed to be an incremental integer from 0 to M, so you can use enumerate
over an iterable to generate this (but you should keep track of the file_ind
yourself if you need to later look up specific files/words). The file_ind is used to keep track of the 'i' part of the ijv. The words are added to a dictionary of terms in the COOBuilder (mapping each unique word to a unique integer).
When the build process is finished, the builder can generate a COO matrix, from the sparse matrix family in scikit
by using the to_coo
method:
coo_matrix = builder.to_coo()
To see the mapping from word index (column) to actual words, you can inspect the terms
dictionary:
>>> builder.terms
>>> {'apple': 0, 'banana': 1, 'pear':2, ...}