This project helps classify retail products into categories. Although in this example the categories are structured in a hierarchy, to keep it simple I considered all subcategories as top-level. The main packages used in this projects are: sklearn, nltk and dataset.
You can read the post explaining this project here.
You will need Python3+ to use this project.
Now, you need the text-classification-python project files in your workspace:
$ git clone https://github.com/joaorafaelm/text-classification-python;
$ cd text-classification-python;
You should already know what is virtualenv at this stage. So, simply create it for the project:
$ virtualenv venv;
$ source venv/bin/activate;
You will find the requirements.txt. To install them, simply type:
$ pip install -r requirements.txt
To run the scraper you will need a csv of ASINS (amazons product identifier). Just search the webz for it. And then run:
python amazon_scrape.py
All data will be saved into sqlite (file database.db), table products.
datafreeze .datafreeze.yaml
This will create a json file under the directory dumps/.
python data_prep.py
The script will create a new file called products.json at the root of the project, and print out the category tree structure. Change the value of the variables default_depth
, min_samples
and domain
if you need more data.
python classify.py
It will print out the accuracy of each category, along with the confusion matrix.