coo-builder

A tool for iteratively building a COO matrix from a large corpus of text data

What is a COO matrix?

A COO matrix (aka 'ijv' or triplet format) is an abstract way of storing a matrix as three arrays, that uses less memory if the matrix is sparse will be sparse.

For example, if you are tracking word occurrences in 100,000 (M) documents and your set of words of interest is 10,000 (N) that is 1,000,000,000 cells of information (or integers) if stored as an MxN array.

However if most of these documents only contain some subset with size of about 200 (S) of the words of interest, you can exploit the fact that this matrix will be very sparse (most values equal to zero). Since only 20,000,000 (SxM) will be non-zero, you can store the ijv-values (the row index, the column index and the cell value) separately in three arrays of size 20,000,000, meaning you only need to store 60,000,000 integers. This is a 94% reduction in memory usage (3 x (100 x [N-S]/N)%)

Instead of storing this matrix directly. You can store

The scenario

You're doing a bag-of-words analysis on some large N number of documents.

You want to build a matrix where:

each row corresponds to a document
each column corresponds to a word
the value in the cell is the # of times that word occured in that document

You want to build this matrix iteratively

The problem

How many columns do you need?
How do you efficiently build this object iteratively

For small datasets this is fine. You can use a list of lists, if you find a new word in a document and need to create a "new column" you just create a new list. For large datasets that becomes a huge memory issue.

The solution

Use a COO matrix.

The coo-builder module is based on this guide to building a COO but it solves the additional problem of how to build a COO matrix iteratively i.e. how do you build it when you don't know ex-ante what shape it will be (which will often be the case if your list of "words of interests" grows as you iteratively read more documents).

Usage

Create a builder object

from coo_builder import COOBuilder

builder = COOBuilder()

Iterate over documents

Iterate over each file/document in your dataset and for each one return a Counter object that maps words to counts e.g. {"hello":3, "world":5}. Then add this counter to the COO by using the add_doc_counter method

for file_ind, file in enumerate(some_iterable_of_files):
	
	### Read/process the file
	
	### Do any sort of formatting, filtering or removal of stop words

	### End up with a mapping from words (str) to counts (int)
	term_counter = # a Counter object (or dictionary)

	### Add to the builder
	builder.add_doc_counter(file_ind,term_counter)

The file_ind is assumed to be an incremental integer from 0 to M, so you can use enumerate over an iterable to generate this (but you should keep track of the file_ind yourself if you need to later look up specific files/words). The file_ind is used to keep track of the 'i' part of the ijv. The words are added to a dictionary of terms in the COOBuilder (mapping each unique word to a unique integer).

Exporting a COO matrix

When the build process is finished, the builder can generate a COO matrix, from the sparse matrix family in scikit by using the to_coo method:

coo_matrix = builder.to_coo()

To see the mapping from word index (column) to actual words, you can inspect the terms dictionary:

>>> builder.terms
>>> {'apple': 0, 'banana': 1, 'pear':2, ...}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
COOBuilder.py		COOBuilder.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

coo-builder

What is a COO matrix?

The scenario

The problem

The solution

Usage

Create a builder object

Iterate over documents

Exporting a COO matrix

About

Releases

Packages

Languages

mangangreg/coo-builder

Folders and files

Latest commit

History

Repository files navigation

coo-builder

What is a COO matrix?

The scenario

The problem

The solution

Usage

Create a builder object

Iterate over documents

Exporting a COO matrix

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages