Please download the processed datasets from Google Drive or 百度网盘 (密码 3cml), and move them here.
dataset/
pretrain/
FHCKM/
downstream/
Scientific/
Pantry/
Instruments/
Arts/
Office/
OR/ # Online Retail
If you have downloaded the processed datasets, you can directly use them for reproduction and further experiments.
If you want to know the details of data preprocessing, please see the instructions below.
[Important!!!] Note that due to the issue of randomness, the processed datasets may not be exactly the same as those released by us. As some items are reviewed at the same timestamp, then these items can have a random order in the item sequences after sorting chronologically.
Please download the raw datasets from the original website.
For the meta data, please click the metadata link of each category in the table "Complete review data" from https://nijianmo.github.io/amazon/index.html.
For the rating data, please click the ratings only link of each category in the table "Small subsets for experimentation" from https://nijianmo.github.io/amazon/index.html#subsets.
Here we take Pantry
for example.
dataset/
raw/
Metadata/
meta_Prime_Pantry.json.gz
Ratings/
Prime_Pantry.csv
cd dataset/preprocessing/
python process_amazon.py --dataset Pantry
# cd dataset/preprocessing/
for ds in Food Home CDs Kindle Movies
do
python process_amazon.py --dataset ${ds} --output_path ../pretrain/ --word_drop_ratio 0.15
done
python to_pretrain_atomic_files.py
path=`pwd`
for ds in Food Home CDs Kindle Movies
do
ln -s ${path}/../pretrain/${ds}/${ds}.feat1CLS ../pretrain/FHCKM/
ln -s ${path}/../pretrain/${ds}/${ds}.feat2CLS ../pretrain/FHCKM/
done
Please download the raw datasets from Kaggle [link] and save archive.zip
into dataset/raw/
.
Unzip and convert it to UTF-8.
mv archive.zip dataset/raw/
cd dataset/raw/
unzip archive.zip
iconv -f latin1 -t utf-8 data.csv > data-utf8.csv
cd dataset/preprocessing/
python process_or.py
You may find some files useful for your research, including:
- Clean item text (
*.text
); - Index mapping between raw IDs and remapped IDs (
*.user2index
,*.item2index
);
For downstream datasets, the corresponding files are naturally in the downstream-datasets.zip
. Once you unzip it, then you may find them.
For pre-trained datasets, the corresponding files are stored in raw-datasets-for-pretrain.zip
.