Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

program frozen #56

Open
dlnewbie-h opened this issue Feb 5, 2021 · 11 comments
Open

program frozen #56

dlnewbie-h opened this issue Feb 5, 2021 · 11 comments

Comments

@dlnewbie-h
Copy link

Dear authors

Your work is very exciting and I want to try out your code. I followed the instructions on readme, and is trying to run this example

CUDA_VISIBLE_DEVICES=0 python fixmatch.py --filters=32 --dataset=cifar10.3@40-1 --train_dir ./experiments/fixmatch

however my program get stucked at self.train_step forever...

I did installed the required environments as you pointed out in the readme.

Do you have any idea what's going on?

I
1593320020

@dlnewbie-h
Copy link
Author

after i keyboard interrupt it, it gives following message:

KeyboardInterrupt
Traceback (most recent call last):
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/queues.py", line 351, in get
with self._rlock:
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
KeyboardInterrupt
^CProcess ForkPoolWorker-868:
Traceback (most recent call last):
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/queues.py", line 351, in get
with self._rlock:
File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
KeyboardInterrupt

@carlini
Copy link
Collaborator

carlini commented Feb 7, 2021

Is the GPU active when you're running the training? Or is that stalling too?

@dlnewbie-h
Copy link
Author

Is the GPU active when you're running the training? Or is that stalling too?

The GPU is stalling too... It's a Tesla V100, the CUDA version is 11.0 and Driver version 450.80.02
(btw I installed the specified requirements using conda, is this causing the issue?)
Thank you

@carlini
Copy link
Collaborator

carlini commented Mar 11, 2021

I just wanted to follow up to see if you worked anything out. I don't know if I have any ideas for what could be causing this with our code, but maybe you found a problem?

@dlnewbie-h
Copy link
Author

Hi Carlini

Really thankful that you followed up on the issue. I really have trouble figuring out the issue which prevents me from using the code.

I first suspect it is some environment issue, here is the environment i'm using:
(I think i matched your specified requirements? I tried the code on both tesla a100 and p100, and still don't work). I would really be helpful if you are able to help me with it.

This file may be used to create an environment using:

$ conda create --name --file

platform: linux-64

_libgcc_mutex=0.1=main
_tflow_select=2.1.0=gpu
absl-py=0.11.0=pyhd3eb1b0_1
astor=0.8.1=py37h06a4308_0
blas=1.0=mkl
c-ares=1.17.1=h27cfd23_0
ca-certificates=2021.1.19=h06a4308_0
certifi=2020.12.5=py37h06a4308_0
cudatoolkit=10.1.243=h6bb024c_0
cudnn=7.6.5=cuda10.1_0
cupti=10.1.168=0
cycler=0.10.0=pypi_0
cython=0.29.21=py37h2531618_0
easydict=1.9=pypi_0
gast=0.4.0=py_0
google-pasta=0.2.0=py_0
grpcio=1.31.0=py37hf8bcb03_0
h5py=2.10.0=py37hd6299e0_1
hdf5=1.10.6=hb1b8bf9_0
importlib-metadata=2.0.0=py_1
intel-openmp=2020.2=254
keras-applications=1.0.8=py_1
keras-preprocessing=1.1.2=pyhd3eb1b0_0
kiwisolver=1.3.1=pypi_0
ld_impl_linux-64=2.33.1=h53a641e_7
libedit=3.1.20191231=h14c3975_1
libffi=3.3=he6710b0_2
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libprotobuf=3.14.0=h8c45485_0
libstdcxx-ng=9.1.0=hdf63c60_0
markdown=3.3.3=py37h06a4308_0
matplotlib=3.3.4=pypi_0
mkl=2020.2=256
mkl-service=2.3.0=py37he8ac12f_0
mkl_fft=1.2.0=py37h23d657b_0
mkl_random=1.1.1=py37h0573a6f_0
ncurses=6.2=he6710b0_1
numpy=1.19.2=py37h54aff64_0
numpy-base=1.19.2=py37hfa32c7d_0
opencv-python=4.5.1.48=pypi_0
openssl=1.1.1i=h27cfd23_0
pandas=1.2.1=py37ha9443f7_0
pillow=8.1.0=pypi_0
pip=20.3.3=py37h06a4308_0
protobuf=3.14.0=py37h2531618_1
pyparsing=2.4.7=pypi_0
python=3.7.9=h7579374_0
python-dateutil=2.8.1=pyhd3eb1b0_0
pytz=2021.1=pyhd3eb1b0_0
readline=8.1=h27cfd23_0
scipy=1.6.0=py37h91f5cce_0
setuptools=52.0.0=py37h06a4308_0
six=1.15.0=py37h06a4308_0
sqlite=3.33.0=h62c20be_0
tensorboard=1.14.0=py37hf484d3e_0
tensorflow=1.14.0=gpu_py37h74c33d7_0
tensorflow-base=1.14.0=gpu_py37he45bfe2_0
tensorflow-estimator=1.14.0=py_0
tensorflow-gpu=1.14.0=h0d30ee6_0
termcolor=1.1.0=py37_1
tk=8.6.10=hbc83047_0
tqdm=4.56.2=pypi_0
werkzeug=1.0.1=pyhd3eb1b0_0
wheel=0.36.2=pyhd3eb1b0_0
wrapt=1.12.1=py37h7b6447c_1
xz=5.2.5=h7b6447c_0
zipp=3.4.0=pyhd3eb1b0_0
zlib=1.2.11=h7b6447c_3

@carlini
Copy link
Collaborator

carlini commented Mar 11, 2021

Huh. Two ideas maybe:

  1. Does CPU-only training work? It would be really slow, but it should at least not stall.

  2. If you try to train a MixMatch model with the MixMatch codebase (https://github.com/google-research/mixmatch) does that work? It's very similar code, and this might help isolate if it's a problem with fixmatch or with the general environment.

@dlnewbie-h
Copy link
Author

I followed your suggestion, which provide me with some useful insights. So

  1. fixmatch code does work on cpu-only training (using another environment that has cpu version tensorflow).
  2. I also tried the MixMatch model you pointed to. MixMatch work on both tesla p100, v100, but not a100. (using the gpu tensorflow environment i showed you above)

So I guess

  1. there might be something unique to fixmatch that make fixmatch doesn't work on my p100 and v100 (while MixMatch works).
  2. There must be something wrong with my a100s.

For my concern now, I would be very happy if i can get fixmatch to work on my p100 or v100. I really don't understand what could have happened that make mixmatch able to run but fixmatch not able to run? Do you have any thoughts?

@carlini
Copy link
Collaborator

carlini commented Mar 12, 2021

That's very interesting. This fixmatch codebase has an implementation of mixmatch. Does that also run properly?

If that works, then maybe try running fixmatch with something like --uratio=1 and see if that helps. Maybe it's the batch size that's the problem?

@dlnewbie-h
Copy link
Author

Hi Carlini

That's very weird, i cannot run the implementation of mixmatch from the fixmatch codebase (while i'm able to run the original mixmatch codebase). May i ask what GPU do you run the fixmatch codebase on? could it be the new code not compatible with some GPUs? (I have very limited knowledge about this, hope that it is not a ridiculous guess...)

@carlini
Copy link
Collaborator

carlini commented Mar 27, 2021

That is very strange. We've never seen any issues with different GPUs in the past, and the two codebases are very similar.

Maybe @david-berthelot has some insight that I'm missing.

@dlnewbie-h
Copy link
Author

In your requirment.txt, there's no specific requirement for python version or cudatoolkit, cudnn, cuda version. Do I need to install any specific version of cudatoolkit, cudnn or cuda?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants