This repository contains the experiment code for running example model benchmarks and data processing that accompanies the paper MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems. This repository is only an example demonstration of how the MerRec dataset can be used in terms of recommendation tasks, and does not depict or reflect production implementation at Mercari.
In the SBR tasks, the raw data is converted to a processed sequences in the memory itself. We don't need to run pre-processing separately. Below are the commands to run various SBR models on the benchmark data.
NextItNet:
python main.py --task_name=sequence --seed=100 --model_name=nextitnet --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0001 --hidden_size=128 --block_num=8 --embedding_size=128 --kernel_size=3 --is_pretrain=1
Bert4Rec:
python main.py --task_name=sequence --seed=100 --model_name=bert4rec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0001 --hidden_size=128 --block_num=16 --embedding_size=128 --num_heads=4 --mask_prob=0.3 --is_pretrain=1
GRU4Rec:
python main.py --task_name=sequence --seed=100 --model_name=gru4rec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0005 --hidden_size=64 --block_num=8 --embedding_size=64 --is_pretrain=1
SASRec:
python main.py --task_name=sequence --seed=100 --model_name=sasrec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=32 --epochs=20 --lr=0.0001 --hidden_size=64 --block_num=8 --embedding_size=64 --num_heads=4 --is_pretrain=1
In both CTR task and MTL task below, the raw dataset first needs to be transformed.
Based on product_id
:
python preprocess_mtl.py --out_path='data/mtl_product.csv' --local_dir_path='data/20230501'
Attention FM (AFM):
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=afm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005
DeepFM:
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=deepfm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005
xDeepFM:
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=xdeepfm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005
DCN:
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=dcn --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005
DCNv2 (DCNMIX):
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=dcnmix --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005
NeuralFM (NFM):
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=nfm --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005
Wide & Deep:
python main_ctr_mtl.py --task_name=ctr --seed=100 --model_name=wdl --data_path='data/mtl_product.csv' --train_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.00005
Only item_view
with MMOE:
python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=mmoe --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=1
Only item_like
with MMOE:
python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=mmoe --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=0
2-task ESMM:
python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=esmm --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=2
2-task MMOE:
python main_ctr_mtl.py --task_name=mtl --seed=100 --model_name=mmoe --data_path='data/mtl_product.csv' --train_batch_size=4096 --val_batch_size=4096 --test_batch_size=4096 --epochs=20 --lr=0.0001 --embedding_size=32 --mtl_task_num=2
Skip-SASRec
python main.py --task_name=inference_acc --seed=5 --model_name=sas4infacc --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=1 --epochs=20 --lr=0.0001 --hidden_size=64 --block_num=8 --embedding_size=64 --num_heads=4 --is_pretrain=1
Skip-NextItNet
python main.py --task_name=inference_acc --seed=5 --model_name=skiprec --data_path='data/20230501' --train_batch_size=32 --val_batch_size=32 --test_batch_size=1 --epochs=20 --lr=0.0001 --hidden_size=128 --block_num=8 --embedding_size=128 --dilation=1,4 --kernel_size=3 --is_pretrain=1
@misc{li2024merrec,
title={MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems},
author={Lichi Li and Zainul Abi Din and Zhen Tan and Sam London and Tianlong Chen and Ajay Daptardar},
year={2024},
eprint={2402.14230},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
- Codebase: This codebase is licensed under the MIT license.
- Dataset: The MerRec dataset is licensed under CC BY-NC 4.0 International.
Contributions are welcomed. Please read the CLA carefully before submitting your contribution to Mercari. Under any circumstances, by submitting your contribution, you are deemed to accept and agree to be bound by the terms and conditions of the CLA.
We would like to thank Guanghu Yuan et al. for their work Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems and making the code publicly available and for the extensive documentation. Many of our experiment implementation centered on product_id
in CTR, MTL and SBR tasks derived from this work.