Product Attribute Extraction

In the e-commerce world, extracting product attributes is important. The extraction of attribute labels and values from free-text product descriptions can be useful for many tasks, such as product matching, product categorization, faceted product search, and product recommendation.

The image below shows the extracted attribute of a product titled 'Spigen Samsung S9 Case Ultra Copper Gold'.

I examine this problem as a sequence labelling task and utilize BERT for Token Classification model to extract multi-attributes from the product offers in Indonesian e-commerce platform. The dataset is obtained from this previous work. There are 16 kinds of attributes in their annotation scheme. Please check the paper directly to get more information about the dataset.

Prepare a Docker environment for the experiment

docker run --rm -it --name=attribute-extraction --gpus '"device=0"' --shm-size 32G -it --mount type=bind,src=<absolute path to product-attribute-extraction folder>,dst=/workspace/   pytorch/pytorch:1.5.1-cuda10.1-cudnn7-devel

pip install -r requirements.txt

Alternatively, you can also use virtual environment.

Pipeline

Data Preparation

The dataset is annotated using the Enamex format. The preparation contains several steps, i.e., convert Enamex to Stanford and convert Stanford to BIO format. After performing a manual inspection on the dataset, I found that some incorrect labellings from the raw data cause a failure when converting the Enamex into Stanford format. To handle this, I manually fix the wrong Enamex format from the original file.

python preprocess.py

Sequence Labelling

Sequence labeling is a typical NLP task that assigns a specific label or class to each token within a sequence. In this context, a single word is a 'token'. These tags can be used in further downstream models as features of the token, or to enhance the model. Fine-tuning BERT for text tagging applications is illustrated in the figure below.

To fine tune the model, please run this script

python fine_tune.py

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
img		img
modules		modules
.gitignore		.gitignore
README.md		README.md
create_train_test.py		create_train_test.py
fine-tune.sh		fine-tune.sh
fine_tune.py		fine_tune.py
helper.py		helper.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Attribute Extraction

Prepare a Docker environment for the experiment

Pipeline

Data Preparation

Sequence Labelling

About

Releases

Packages

Languages

mhilmiasyrofi/product-attribute-extraction

Folders and files

Latest commit

History

Repository files navigation

Product Attribute Extraction

Prepare a Docker environment for the experiment

Pipeline

Data Preparation

Sequence Labelling

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages