Revealing the Dark Secrets of Masked Image Modeling (Depth Estimation) [Paper]

Main results

Results on NYUv2

Backbone	d1	d2	d3	abs_rel	rmse	rmse_log
Swin-v2-Base	0.935	0.991	0.998	0.044	0.304	0.109
Swin-v2-Large	0.949	0.994	0.999	0.036	0.287	0.102

Results on KITTI

Backbone	d1	d2	d3	abs_rel	rmse	rmse_log
Swin-v2-Base	0.976	0.998	0.999	0.052	2.050	0.078
Swin-v2-Large	0.977	0.998	1.000	0.050	1.966	0.075

Preparation

Please refer to [GLPDepth] for configuring the environment and preparing the NYUV2 and KITTI datasets. You can download pretrained models and our well-trained models from zoo(OneDrive).

Training

Training with model (NYU Depth V2 Swin-Base)

$ python3 train.py --dataset nyudepthv2 --data_path ../data/ --max_depth 10.0 --max_depth_eval 10.0  --backbone swin_base_v2 --depths 2 2 18 2 --num_filters 32 32 32 --deconv_kernels 2 2 2 --window_size 30 30 30 15 --pretrain_window_size 12 12 12 6 --use_shift True True False False --flip_test --shift_window_test --shift_size 2 --pretrained weights/swin_v2_base_simmim.pth --save_model --crop_h 480 --crop_w 480 --layer_decay 0.9 --drop_path_rate 0.3 --log_dir logs/

Training with model (NYU Depth V2 Swin-Large)

$ python3 train.py --dataset nyudepthv2 --data_path ../data/ --max_depth 10.0 --max_depth_eval 10.0  --backbone swin_large_v2 --depths 2 2 18 2 --num_filters 32 32 32 --deconv_kernels 2 2 2 --window_size 30 30 30 15 --pretrain_window_size 12 12 12 6 --use_shift True True False False --flip_test --shift_window_test --shift_size 2 --pretrained weights/swin_v2_large_simmim.pth --save_model --crop_h 480 --crop_w 480 --layer_decay 0.85 --drop_path_rate 0.5 --log_dir logs/

Training with model (KITTI Swin-Base)

$ python3 train.py --dataset kitti --kitti_crop garg_crop --data_path ../data/ --max_depth 80.0 --max_depth_eval 80.0 --backbone swin_base_v2 --depths 2 2 18 2 --num_filters 32 32 32 --deconv_kernels 2 2 2 --window_size 22 22 22 11 --pretrain_window_size 12 12 12 6 --use_shift True True False False --flip_test --shift_window_test --shift_size 16 --pretrained weights/swin_v2_base_simmim.pth --save_model --crop_h 352 --crop_w 352 --layer_decay 0.9 --drop_path_rate 0.3 --log_dir logs/

Training with model (KITTI Swin-Large)

$ python3 train.py --dataset kitti --kitti_crop garg_crop --data_path ../data/ --max_depth 80.0 --max_depth_eval 80.0 --backbone swin_large_v2 --depths 2 2 18 2 --num_filters 32 32 32 --deconv_kernels 2 2 2 --window_size 22 22 22 11 --pretrain_window_size 12 12 12 6 --use_shift True True False False --flip_test --shift_window_test --shift_size 16 --pretrained weights/swin_v2_large_simmim.pth --save_model --crop_h 352 --crop_w 352 --layer_decay 0.85 --drop_path_rate 0.5 --log_dir logs/

Evaluation

Evaluate with model (NYU Depth V2 Swin-Base)

$ python3 test.py --dataset nyudepthv2 --data_path ../data/ --max_depth 10.0 --max_depth_eval 10.0  --backbone swin_base_v2 --depths 2 2 18 2 --num_filters 32 32 32 --deconv_kernels 2 2 2 --window_size 30 30 30 15 --pretrain_window_size 12 12 12 6 --use_shift True True False False --flip_test --shift_window_test --shift_size 2 --do_evaluate --ckpt_dir ckpt/nyudepthv2_swin_base.ckpt

Evaluate with model (NYU Depth V2 Swin-Large)

$ python3 test.py --dataset nyudepthv2 --data_path ../data/ --max_depth 10.0 --max_depth_eval 10.0  --backbone swin_large_v2 --depths 2 2 18 2 --num_filters 32 32 32 --deconv_kernels 2 2 2 --window_size 30 30 30 15 --pretrain_window_size 12 12 12 6 --use_shift True True False False --flip_test --shift_window_test --shift_size 2 --do_evaluate --ckpt_dir ckpt/nyudepthv2_swin_large.ckpt

Evaluate with model (KITTI Swin-Base)

$ python3 test.py --dataset kitti --kitti_crop garg_crop --data_path ../data/ --max_depth 80.0 --max_depth_eval 80.0 --backbone swin_base_v2 --depths 2 2 18 2 --num_filters 32 32 32 --deconv_kernels 2 2 2 --window_size 22 22 22 11 --pretrain_window_size 12 12 12 6 --use_shift True True False False --flip_test --shift_window_test --shift_size 16 --do_evaluate --ckpt_dir ckpt/kitti_swin_base.ckpt

Evaluate with model (KITTI Swin-Large)

$ python3 test.py --dataset kitti --kitti_crop garg_crop --data_path ../data/ --max_depth 80.0 --max_depth_eval 80.0 --backbone swin_large_v2 --depths 2 2 18 2 --num_filters 32 32 32 --deconv_kernels 2 2 2 --window_size 22 22 22 11 --pretrain_window_size 12 12 12 6 --use_shift True True False False --flip_test --shift_window_test --shift_size 16 --do_evaluate --ckpt_dir ckpt/kitti_swin_large.ckpt

Citation

@article{xie2023darkmim,
  title={Revealing the Dark Secrets of Masked Image Modeling},
  author={Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, Yue Cao},
  journal={arXiv preprint arXiv:2205.13543},
  year={2022}
}

Acknowledge

Our code is mainly based on GLPDepth[1]. The code of the model is from SwinTransformer[2] and Simple Baseline[3].

[1] Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth. [code]

[2] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. [code]

[3] Simple Baselines for Human Pose Estimation and Tracking. [code]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
dataset		dataset
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Revealing the Dark Secrets of Masked Image Modeling (Depth Estimation) [Paper]

Main results

Results on NYUv2

Results on KITTI

Preparation

Training

Evaluation

Citation

Acknowledge

About

Releases

Packages

Languages

License

SwinTransformer/MIM-Depth-Estimation

Folders and files

Latest commit

History

Repository files navigation

Revealing the Dark Secrets of Masked Image Modeling (Depth Estimation) [Paper]

Main results

Results on NYUv2

Results on KITTI

Preparation

Training

Evaluation

Citation

Acknowledge

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages