Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Retrain DL in Quickannotator #13

Open
anuradhakar49 opened this issue Oct 17, 2021 · 22 comments
Open

Problem with Retrain DL in Quickannotator #13

anuradhakar49 opened this issue Oct 17, 2021 · 22 comments

Comments

@anuradhakar49
Copy link

Hi,
The installation of the tool runs smoothly as described in the Github repository but I am encountering problems with retraining the deep learning model. For example, after adding 2 pairs of images in a new project, making patches and annotations and uploading them as training and test images, if we click "Retrain model" on the Project page, I am getting the ERROR: train_autoencoder (job N) failed. On the Annotations page, clicking the "Retrain DL" button displays an HTML error.

Please provide suggestions on how to resolve these errors.
Anuradha Kar

@choosehappy
Copy link
Owner

choosehappy commented Oct 19, 2021 via email

@mariokreutzfeldt
Copy link

Hi @anuradhakar49 and @choosehappy

could you solve the issue? I'm having the same problem:

2021-11-25 13:54:10,872 [INFO] (THREAD 18304) About to train a new transfer model for try2
2021-11-25 13:54:10,887 INFO sqlalchemy.engine.base.Engine ROLLBACK
2021-11-25 13:54:10,887 [INFO] (THREAD 18304) ROLLBACK
2021-11-25 13:54:10,888 [INFO] (THREAD 18304) 127.0.0.1 - - [25/Nov/2021 13:54:10] "GET /api/try2/retrain_dl?frommodelid=0 HTTP/1.1" 404 -

System:
Win10, python 3.8, cuda 10.2

Best regards,
Mario

@choosehappy
Copy link
Owner

Sorry to hear this Mario!

Is this information you're putting here from the command line itself, or is it coming from the log file?

If you can send over the entire associated log file that would be appreciated

In the end, we were able to fix anuradhakar49's problem, it was environmental. if I remember correctly it was an incompatible cuda driver + cuda version? @tasvora may have additional info

@tasvora
Copy link
Collaborator

tasvora commented Nov 25, 2021 via email

@anuradhakar49
Copy link
Author

Yes this issue is solved and was linked to cuda +torch versions. @mariokreutzfeldt Please check if you have a cuda compatible GPU and that your code is being able to access the GPU (i.e the GPU is not busy with another task) . Also make sure the pytorch version is compatible with cuda 10.2 (https://pytorch.org/get-started/previous-versions/) Else try a reinstall with torch CPU only version to test.

@mariokreutzfeldt
Copy link

Dear all,
thank you for your fast replies!!

I have verified the CUDA installation via nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Wed_Oct_23_19:32:27_Pacific_Daylight_Time_2019 Cuda compilation tools, release 10.2, V10.2.89

and pytorch installation via torch.cuda.is_available()
true

During installation of QA I ran into many unresolvable version issues.
So I ended up installing the following.

numpy==1.17.3 Flask_SQLAlchemy==2.4.0 scikit_image==0.16.2 scikit_learn==0.24.0 opencv_python_headless==4.1.2.30 scipy==1.4.1 requests==2.22.0 SQLAlchemy==1.3.5 tensorboard==2.4.1 ttach==0.0.2 albumentations==0.4.3 config==0.4.2 Flask==1.0.3 Pillow==8.1.2 llvmlite==0.34.0 numba umap-learn Flask_Restless==0.17.0 python-openslide==1.1.2

For Pytorch I had the automatic installation already fail for another project, so I downloaded the packages manually.
torch 1.8.1+cu102
torchaudio 0.10.0+cu102
torchvision 0.9.1+cu102

I installed torch first. When I installed torchaudio and torchvision it would deinstall torch and replace it with a non-cuda version.
So I installed torch+cu102 again after having installed torchaudio and torchvision.

@choosehappy, the complete log is here

Best regards,
Mario

@mariokreutzfeldt
Copy link

Quick additional info:
replacing the CUDA with CPU versions of pytorch did not solve it.
Still getting ERROR 404.

@choosehappy
Copy link
Owner

choosehappy commented Nov 26, 2021 via email

@mariokreutzfeldt
Copy link

@choosehappy here you go.

Doesn`t contain data.db-wal because the file was 0kb.

@choosehappy
Copy link
Owner

choosehappy commented Nov 26, 2021 via email

@tasvora
Copy link
Collaborator

tasvora commented Nov 26, 2021 via email

@mariokreutzfeldt
Copy link

Here are the log files and the data.db after changing the config.
I am using the CPU version of pytorch now and have seen that one project is giving me a "not enough training/test images"..which makes sense. The second project is still giving error 400.

@choosehappy
Copy link
Owner

hmm...i think we'll have to jump on a call, these log files and database seem to indicate that things are working as expected : )

@mariokreutzfeldt
Copy link

Thank you @choosehappy and @tasvora for helping solve this issue!
In case someone else is having this problem, it turned out that I had a broken svml_dispmd.dll (730kb instead of 18MB).
Also, make sure scikit-image==0.18.1 is installed.

Best regards,
Mario

@stellaqu123
Copy link

Hi @choosehappy and @mariokreutzfeldt,
I have the same problem about Retrain DL in Quickannotator. After annotating a patch, when I ran Retrain DL -From base, I got error message like "ERROR 404: (Unknown error)". The shotcut is as below
image. The console log is like
"2022-06-09 08:49:11,130 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2022-06-09 08:49:11,130 [INFO] (THREAD 139621868058368) BEGIN (implicit)
2022-06-09 08:49:11,131 INFO sqlalchemy.engine.base.Engine SELECT project.id AS project_id, project.name AS project_name, project.description AS project_description, project.date AS project_date, project.train_ae_time AS project_train_ae_time, project.make_patches_time AS project_make_patches_time, project.iteration AS project_iteration, project.embed_iteration AS project_embed_iteration
FROM project
WHERE project.name = ?
LIMIT ? OFFSET ?
2022-06-09 08:49:11,131 [INFO] (THREAD 139621868058368) SELECT project.id AS project_id, project.name AS project_name, project.description AS project_description, project.date AS project_date, project.train_ae_time AS project_train_ae_time, project.make_patches_time AS project_make_patches_time, project.iteration AS project_iteration, project.embed_iteration AS project_embed_iteration
FROM project
WHERE project.name = ?
LIMIT ? OFFSET ?
2022-06-09 08:49:11,131 INFO sqlalchemy.engine.base.Engine ('test1', 1, 0)
2022-06-09 08:49:11,131 [INFO] (THREAD 139621868058368) ('test1', 1, 0)
2022-06-09 08:49:11,131 [INFO] (THREAD 139621868058368) About to train a new transfer model for test1
2022-06-09 08:49:11,131 [INFO] (THREAD 139621868058368) About to train a new transfer model for test1
2022-06-09 08:49:11,132 INFO sqlalchemy.engine.base.Engine ROLLBACK
2022-06-09 08:49:11,132 [INFO] (THREAD 139621868058368) ROLLBACK
2022-06-09 08:49:11,132 [INFO] (THREAD 139621868058368) 124.126.17.86 - - [09/Jun/2022 08:49:11] "GET /api/test1/retrain_dl?frommodelid=0 HTTP/1.1" 404 -"
According to your previous talk recordings, I checked my cuda version and pytorch version, which is compatible. pytorch installation via torch.cuda.is_available()
true.
Hoping I could get help about this issue.
Best regards,
Xiaoping

@choosehappy
Copy link
Owner

we can start by collecting more information:

  1. operating system + version
  2. python version
  3. pip freeze output
  4. cuda version
  5. Nvidia GPU version

@stellaqu123
Copy link

Sure.

  1. operating system + version
    I use Amazon EC2 linux system. By using command "cat /proc/version", the version is "Linux version 4.14.238-125.422.amzn1.x86_64 (mockbuild@koji-pdx-corp-builder-64004) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)) Error when uploading a completed annotation to db #1 SMP Tue Jul 20 20:51:46 UTC 2021".
  2. python version
    python 3.8.13
  3. pip freeze output
    the output is here,
    pip_freeze_output.txt
  4. cuda version
    with command "nvcc --version", the information is as below
    "
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2018 NVIDIA Corporation
    Built on Sat_Aug_25_21:08:01_CDT_2018
    Cuda compilation tools, release 10.0, V10.0.130
    "
    5.Nvidia GPU version
    NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 CUDA Version: 11.0
    T4
  5. torch version and cuda version
    torch version: 1.8.1+cu111
    torch.cuda.is_availabel() return True

@choosehappy
Copy link
Owner

hmmm!! this all looks very reasonable!

is there any additional information in the console window at the top of the screen on the right?

In looking at the API itself and the console information you provided, the only 404 message that seems reasonable is here:

error=f"Deep learning model {frommodelid} doesn't exist"), 404

This would seem to suggest that you don't have a base model already trained? is that the case?

if you look here:

https://github.com/choosehappy/QuickAnnotator/wiki/Image-List-Page

did you use the "3. (re)train model 0" button?

this step is needed to give good default weights

@stellaqu123
Copy link

Thanks @choosehappy .
I didn't use "3.(re)train model 0 "button before.
When I use "3.(re)train model 0" button, I got error message in console, which is like
"
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:

  1. Downgrade the protobuf package to 3.20.x or lower.
  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
    ".
    After downgrade protobuf package to 3.19.1, “3 (re)train model 0” and Retrain DL function work. The problem is solved.
    Thanks for your help! 👍

@choosehappy
Copy link
Owner

choosehappy commented Jun 14, 2022 via email

@stellaqu123
Copy link

yes. I could use Quickannotator Retrain DL function.
I did't use docker. I just installed this package in my operating system.

@choosehappy
Copy link
Owner

choosehappy commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants