Address Element Extraction

Shopee Code League 2021 - Data Science Competition

In this competition, we built an AI solution to correctly extract point of interest (POI) names and street names from unformatted Indonesia addresses collected by Shopee. We are happy to share our solution, which is ranked 28th (from 1,034 teams) in this competition. Please check the Kaggle's private leaderboard in this link.

Problem Description

Given a raw_address, the AI model should provide two prediction results, one for POI and one for street. POI and street should be separated with a “/” character without any spaces in between. There are cases where POI/street elements in the raw_address are not complete. For this case, the model also need to predict the complete subwords before returning the result.

id	raw_address	POI/street
1	karang mulia bengkel mandiri motor raya bosnik 21 blak kota	bengkel mandiri motor/raya bosnik
2	primkob pabri adiwerna	primkob pabri/
3	jalan mh thamrin, sei rengas i kel. medan kota	/jalan mh thamrin
4	smk karya pemban, pon	smk karya pembangunan/pon

Explanation:

The POI is "bengkel mandiri motor" and street name is "raya bosnik" the returned POI/street should be:
- "bengkel mandiri motor/raya bosnik"
The POI is "primkob pabri" and no street name is found the returned POI/street should be:
- "primkob pabri/"
No POI is found and the street name is "jalan mh thamrin" the returned POI/street should be:
- "/jalan mh thamrin"
The word "pembangunan" in raw_address "smk karya pemban, pon" is not complete. The correct POI will be "smk karya pembangunan" and the returned result should be:
- smk karya pembangunan/pon

Solution

1. Data Preparation

1.1 Text Cleaning

Drop data which POI/street contains dot -> small occurence (0.3% from the data) and can be noisy to the model
Clean raw_address -> remove multiple whitespace, remove dot, restructure (correct) punctuation, and remove bracket

Please check at Data-Cleaning.ipynb for the implementation.

1.2 Text Repair

Utilize a probabilistic model to repair texts in the raw address. The probabilistic model employs the frequency information of transformed n-gram from the train data.

Examples of frequency information of transformed n-gram:
- transform_occurency["cak"] = {'cakung': 15, 'cakruk': 1, "cake's": 1, 'cakery': 1, 'cakrad': 1, 'cakrab': 1}
- transform_occurency["taman mer"] = {'taman meruya': 2}
In the examples above, the word "cak" in the training data is transformed 15 times into "cakung". For more accurate frequency information, we also utilize bigram, 3-gram, and 4-gram transform_occurence information.

Please check at Data-Formatting.ipynb for the implementation

2. Customized Named Entity Recognition

2.1 Create BIO Tag

Assume POI and street as entities. Frame the problem as named entity recognition (NER), i.e. extract entitites (POI and street) from texts (raw_address)
Construct train and test data with BIO tags for custom NER

2.2 Fine-tune IndoBERT

Split train data into train and validation. Use test data for submission. Generate BIO tags for creating custom Named Entity Recognition (NER)

python3 create_train_label.py # create train and validation data
python3 create_test_label.py # create test data

Fine-tune and evaluate IndoBERT model to build custom NER

python3 train.py # fine-tune NER model
python3 eval.py # generate csv for submission

Preparing Environment

Before replicating the result, please prepare the environment of the experiment. We run our experiment using Docker, started with huggingface/transformers-pytorch-gpu:3.4.0 image. You can pull the docker using this command

docker pull huggingface/transformers-pytorch-gpu:3.4.0

After running the image as a container, please install some required libraries

bash install.sh

Model Performance

(Epoch 16) TRAIN LOSS:0.0020 ACC:1.00 F1:1.00 REC:1.00 PRE:1.00 LR:0.00000500
(Epoch 16) VALID LOSS:0.1394 ACC:0.98 F1:0.94 REC:0.94 PRE:0.94
save model checkpoint at models/bert-large/32_128_3e-05/
(Epoch 17) TRAIN LOSS:0.0019 ACC:1.00 F1:1.00 REC:1.00 PRE:1.00 LR:0.00000500
(Epoch 17) VALID LOSS:0.1440 ACC:0.98 F1:0.94 REC:0.94 PRE:0.94
save model checkpoint at models/bert-large/32_128_3e-05/

Thanks for reading :) Don't hestitate to contact me, mhilmiasyrofi(at)gmail(dot)com, if you need further assistance to replicate the result!!!

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
data		data
img		img
indonlu		indonlu
submissions		submissions
.gitignore		.gitignore
Data-Cleaning.ipynb		Data-Cleaning.ipynb
Data-Formatting.ipynb		Data-Formatting.ipynb
Repair-Words.ipynb		Repair-Words.ipynb
create_test_label.py		create_test_label.py
create_train_label.py		create_train_label.py
eval.py		eval.py
install.sh		install.sh
readme.md		readme.md
requirements.txt		requirements.txt
resume.py		resume.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Address Element Extraction

Shopee Code League 2021 - Data Science Competition

Solution

1. Data Preparation

1.1 Text Cleaning

1.2 Text Repair

2. Customized Named Entity Recognition

2.1 Create BIO Tag

2.2 Fine-tune IndoBERT

Model Performance

About

Releases

Packages

Contributors 2

Languages

mhilmiasyrofi/AddressExtraction

Folders and files

Latest commit

History

Repository files navigation

Address Element Extraction

Shopee Code League 2021 - Data Science Competition

Solution

1. Data Preparation

1.1 Text Cleaning

1.2 Text Repair

2. Customized Named Entity Recognition

2.1 Create BIO Tag

2.2 Fine-tune IndoBERT

Model Performance

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages