Skip to content

Latest commit

 

History

History
157 lines (102 loc) · 6.65 KB

README.md

File metadata and controls

157 lines (102 loc) · 6.65 KB

MerRec Recommendation Dataset

This repository contains the experiment code for running example model benchmarks and data processing that accompanies the paper MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems. This repository is only an example demonstration of how the MerRec dataset can be used in terms of recommendation tasks, and does not depict or reflect production implementation at Mercari.

Benchmark Execution

Session-based Recommendation (SBR)

In the SBR tasks, the raw data is converted to a processed sequences in the memory itself. We don't need to run pre-processing separately. Below are the commands to run various SBR models on the benchmark data.

NextItNet:

python main.py --task_name=sequence --seed=100 --model_name=nextitnet --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0001 --hidden_size=128 --block_num=8 --embedding_size=128 --kernel_size=3 --is_pretrain=1

Bert4Rec:

python main.py --task_name=sequence --seed=100 --model_name=bert4rec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0001 --hidden_size=128 --block_num=16 --embedding_size=128 --num_heads=4 --mask_prob=0.3 --is_pretrain=1

GRU4Rec:

python main.py --task_name=sequence --seed=100 --model_name=gru4rec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0005 --hidden_size=64 --block_num=8 --embedding_size=64 --is_pretrain=1

SASRec:

python main.py --task_name=sequence --seed=100 --model_name=sasrec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0001 --hidden_size=64 --block_num=8 --embedding_size=64 --num_heads=4 --is_pretrain=1

Data Preprocessing for CTR and MTL Tasks

In both CTR task and MTL task below, the raw dataset first needs to be transformed.

Based on product_id:

python preprocess_mtl.py --out_path='data/mtl_product.csv' --local_dir_path='data/20230501'

Click-through Rate (CTR) Prediction

Attention FM (AFM):

python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=afm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005

DeepFM:

python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=deepfm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005

xDeepFM:

python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=xdeepfm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005

DCN:

python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=dcn --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005

DCNv2 (DCNMIX):

python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=dcnmix --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005

NeuralFM (NFM):

python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=nfm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005

Wide & Deep:

python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=wdl --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005

Multi-task Learning (MTL) for Recommendation

Only item_view with MMOE:

python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=mmoe --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=1

Only item_like with MMOE:

python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=mmoe --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=0

2-task ESMM:

python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=esmm --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=2

2-task MMOE:

python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=mmoe --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=2

Model Inference Acceleration

Skip-SASRec

python main.py --task_name=inference_acc --seed=5 --model_name=sas4infacc --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=1 --epochs=20 --lr=0.0001 --hidden_size=64 --block_num=8 --embedding_size=64 --num_heads=4 --is_pretrain=1

Skip-NextItNet

python main.py --task_name=inference_acc --seed=5 --model_name=skiprec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=1 --epochs=20 --lr=0.0001 --hidden_size=128 --block_num=8 --embedding_size=128 --dilation=1,4 --kernel_size=3 --is_pretrain=1

BibTeX

@misc{li2024merrec,
      title={MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems}, 
      author={Lichi Li and Zainul Abi Din and Zhen Tan and Sam London and Tianlong Chen and Ajay Daptardar},
      year={2024},
      eprint={2402.14230},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

License

How to Contribute

Contributions are welcomed. Please read the CLA carefully before submitting your contribution to Mercari. Under any circumstances, by submitting your contribution, you are deemed to accept and agree to be bound by the terms and conditions of the CLA.

https://www.mercari.com/cla/

Acknowledgement

We would like to thank Guanghu Yuan et al. for their work Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems and making the code publicly available and for the extensive documentation. Many of our experiment implementation centered on product_id in CTR, MTL and SBR tasks derived from this work.