Notice: The repository is far from completed, all codes I provided are just for share and will often be changed. And I do not promise any performance and accurate reproduction.
This project is the collection of the codes reproduced by me and the paper sites about the clustering method in the past years. It also contains many methods which were publicly published and have shared their codes.
In the recent years, as the representation of unsupervised tasks, the clustering task received great attention from the researchers. Many related works came up and achieved significant success. For most works, the author gave the code for others to use, however, some of them are not fully complete, besides, the running environments and code frameworks are very different or out of fashion. So I decide to collect the papers as well as the codes, and bring them togethor.
-
If you want to use the code provided by this repository, the first thing you need to do is to select a proper position and clone this repository from github. The command you need is:
git clone [email protected]:Mr-SGXXX/Clustering.git cd Clustering
-
After downloading the repository, you need to construct a proper python environment. I advise you to use the conda, which can easily build a nice environment without influencing your other project setting. You can download Anaconda here, but I suggest you to install miniconda following this site. To create a new conda environment, you can use the following command:
conda create -n clustering python=3.9 conda activate clustering
-
If you want to run the codes, you need to install the packages this repository used. And to install them, you can run the following command:
pip install -r requirements.txt
-
After preparing the running environment, you can choose which dataset for clustering and which method you want to use in the config files. For the chosen dataset or method, you can also change the hyper-parameters setting in the config file.
-
Run the experiment you set in the config file by the following bash command:
python main.py --config_path /path/to/config.cfg --device 0
or just:
python main.py -cp /path/to/config.cfg -d 0
then just wait or do anything else you want.
-
After the experiment, there will be a log file containing the collected message during the experiment as well as a set of figures generated based the features and scores of the experiment. If you stop too many experiments, you can use the following bash script to remove the useless log.
bash ./clean_log.sh /path/to/log /path/to/figures
Or when you use the default log and figure path, you can use:
bash ./clean_log.sh
Notice: Do Not Use The Script When Running Any Experiment.
-
If you want to add a new method or dataset based on this repository, you can firstly lookup the '__init__.py' file for method package (divided into classical methods and deep methods) or dataset package. Then you can design your method and benefit from the pipeline.
Those methods not using deep learning will be included in the this part.
- SSC-OMP (CVPR 2016) | Reference Code
- EDSC (WACV 2014) | Reference Code
- KSSC (ICIP 2014) | Reference Code
- LRSC (Pattern Recognition Letters 2014) | Reference Code
- LRR (ICML 2010) | Reference Code
- SSC (CVPR 2009) | Reference Code
- DBSCAN (KDD 1996) | Reference Code
- KMeans
- Spectral Clustering
Those methods using deep learning will be included in this part. Notice that those muti-view clustering methods and GNN-based clustering methods are not includes here.
- DMICC (AAAI 2023) | Reference Code
- DivClust (CVPR 2023) | Reference Code
- SPICE (TIP 2022) | Reference Code
- ProPos (TPAMI 2022) | Reference Code
- DECCS (ICDM 2022) | Reference Code
- DeepDPM (CVPR 2022) | Reference Code
- EDESC (CVPR 2022) | Reference Code | My Implementation
In the code provided by the authors, they gave a pretrained weight for Reuters10K, with it, we can gain a nice result sometimes not lower than the article for Reuters10K dataset, but pretraining from start following the code setting in the article instead of using the pretrain weight, the score is hardly as good as what it should be, but similar to this repositary. Besides, the result is not stable.
- VaDeSC (ICLR 2022) | Reference Code
- C3-GAN (ICLR 2022) | Reference Code
- HC-MGAN (AAAI 2022) | Reference Code
- MFCVAE (NIPS 2021) | Reference Code
- CLD (CVPR 2021) | Reference Code
- NNM (CVPR 2021) | Reference Code
- DLRRPD (CVPR 2021) | Reference Code
- RUC (CVPR 2021) | Reference Code
- SENet (CVPR 2021) | Reference Code
- IDFD (ICLR 2021) | Reference Code
- MiCE (ICLR 2021) | Reference Code
- ACe/Dec (IJCAI 2021) | Reference Code
- DipDECK (KDD 2021) | Reference Code
- CC (AAAI 2021) | Reference Code
- DFCN (AAAI 2021) | Reference Code
- SCCL (NAACL 2021) | Reference Code
- PSSC (TIP 2021) | Reference Code
- SCAN (ECCV 2020) | Reference Code
- EMRC (AAAI 2020) | Reference Code
- PICA (CVPR 2020) | Reference Code
- IIC (ICCV 2019) | Reference Code
- DEC-DA (ACML 2018) | Reference Code
- DeepCluster (ECCV 2018) | Reference Code | My Implementation
This method is designed for clustering on large dataset like ImageNet, and don't work well for the small datasets. In the official implementation, the author gave the detailed scripts about their experiments in the article, which contains using conv features of different level to do LogisticRegression for clustering and using these features for object detection. This repository doesn't offer these parts, and it only gives the clustering result by doing classical clustering on the fc features, which is also called the last epoch cluster assignments.
- SpectralNet (ICLR 2018) | Reference Code
- DSC-Nets (NIPS 2017) | Reference Code
- DEPICT (ICCV 2017) | Reference Code
- IDEC (IJCAI 2017) | Reference Code | My Implementation
In this method, most codes are the same as DEC, except the clustering process. Instead of only using KL loss, the IDEC adds the reconstruct loss in clustering process. Because the IDEC use the same pretrain process as the DEC, in order to save time, the IDEC will directly use the DEC pretrain weight
- VaDE (IJCAI 2017) | Reference Code
- DCN (ICML 2017) | Reference Code
- DEC (ICML 2016) | Reference Code | My Implementaion
In this method, the pretrain process is the most important part, whether the features are learned well by pretraining is directly correspond to whether the result is good. With a reproduced greedy layer-wise pretraining referred to the DEC paper, the pretrained weight is more likely to be good, by which the DEC method is more likely to gain a good score. Though the best score in many experimnets is no lower than the score in the article, the method is still not stable, scores of multiple experiments are very different.
-
MNIST
-
Fashion MNIST
-
CIFAR-10
-
CIFAR-100
-
STL-10
- Reuters-10K:
Notice: The Reuters-10K used here is most likely the same as the Reuters-10K used in DEC, which is generated by select random 10000 sample from the original Reuters with 685071 samples. Because the original Reuters dataset download url in DEC repository is not available now, the total dataset experiment is not possible for now.
All the experimental results you can see in this repository are obtained based on the code provided in this repository. Due to factors such as experimental environment and parameter settings, these results may differ slightly or greatly from those in the original paper. I strive to ensure the accuracy of the results, but can't guarantee exact correspondence with the original paper.
The possible difference reasons from my personal view:
- The problem of clustering usually is not stable, the difference of initializing will cause significant difference of results.
- Not all methods were orginally implemented by pytorch, besides different pytorch version may cause difference. This repository may implement the method in a different way.
- Different hardware devices may cause some different results for their slightly different calculating process.
- The results of some methods strictly depends on some weight from a excellent but rare pretrain try, which doesn't occur all the time, causing the scores are easily lower than what authors declaimed.
- Some methods don't offer the hyper-parameter setting they used for all dataset, for these methods we use the default hyper-parameter they offered in their code or paper.
- The public code of some methods can not be run correctly for some bugs or outdated APIs. Though we try to fix these errors, it may cause some diffence of the results.
- Some methods unfairly used the best epoch recognized by clustering evaluation metrics(ACC, ARI, etc) in the clustering process, which needs ground truth information.(Early stop doesn't mean you can use unfair setting)
- There may be some bugs in this repository which influence the score of some methods. If you find any bug, welcome to raise issues or contact me through email.
The hardware environment accessible to me as follows:
- CPU: Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz
- GPU: NVIDIA GTX 2080ti
- Memory Size: 64GB
Each method in each dataset will be tried for several times for fair. For deep methods, only the result of the last epoch or the result chosen in a no-need ground truth way will be used. The highest scores as well as mean and std shown as the table with the format "max, mean(std)". The running time of the deep methods contain pretrain time and clustering time.
Method | Test Times | ACC | NMI | ARI |
---|---|---|---|---|
EDESC | 16 | 0.7632, 0.6978(0.0575) | 0.5849, 0.4686(0.0591) | 0.5927, 0.4826(0.0730) |
DEC | 16 | 0.7366, 0.6440(0.0456) | 0.4879, 0.4228(0.0417) | 0.4591, 0.3936(0.0452) |
Spectral Clustering | 8 | 0.4441, 0.4441(0.0000) | 0.0905, 0.0905(0.0000) | 0.0175, 0.0175(0.0000) |
KMeans | 16 | 0.5622, 0.5301(0.0162) | 0.3549, 0.3243(0.0195) | 0.2655, 0.2211(0.0190) |
In the end, I would like to express my gratitude to all researchers in the field of clustering and the entire AI community for their contributions. Thank you for their willingness to open-source their code.
In addition, thank to the github for the copilot assistant which greatly improved my efficence.
My email is [email protected]. If you have any question or advice, please contact me.