Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: docker improvements #264

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

vladlearns
Copy link

@vladlearns vladlearns commented Jun 7, 2024

Features:

  1. Includes CUDA 12.1 and cuDNN 8.9.7 & CUDA/cuDNN installation process. Solved: Kernel died #215 and libcudnn error. #225.
  2. Optimized Docker layer caching for faster builds.
  3. Added ability to download only necessary checkpoints. Solved: time and space.
  4. Switched base image to python:3.10-slim.

This setup has been thoroughly tested to ensure stability and performance.

Prerequisites:

Join the NVIDIA Developer Program:

  1. Go to the NVIDIA Developer Program.
  2. Sign up for an account if you don't already have one.
  3. Once you have an account, log in to the NVIDIA Developer website.

Download cuDNN:

  1. Navigate to the cuDNN Archive.
  2. Select the version you need (cuDNN 8.9.7 for CUDA 12.1).
  3. Download the appropriate file for Linux (should look like cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz).
  4. Place file in the root of the directory.
    image

Run:

docker build -t openvoice .
then
docker run --gpus all -p 8888:8888 openvoice v2

tl;dr

Hey everyone,

I've been working on improving the Docker setup for OpenVoice, and I think these changes will make it much easier to run in a containerized environment.

The main issue I've seen is with CUDA and cuDNN versions not matching up, causing errors. In this Dockerfile, I've included CUDA 12.1 and cuDNN 8.9.7, which work well with the latest PyTorch that supports CUDA 12. This should help eliminate those errors.

Another improvement is the entrypoint shell script. Additionally, you can now download only the checkpoints you need: it will only download the checkpoints for the specified version, saving time and bandwidth.

I've also optimized the Docker layer cache. I rearranged some commands so that if only the local files change, Docker can reuse the base layers that have all the lengthy installations. This should speed up your builds when you're making changes to your local setup.

In summary, smoother, faster, and less prone to errors. It's now easier to spin up different versions and notebooks without CUDA issues or long installations.

This setup has been thoroughly tested to ensure stability and performance.

Give it a try and let me know how it goes! I'm always happy to hear feedback and suggestions. I think this will be a big improvement for the OpenVoice experience.

Happy Dockerizing! 🐳
Vlad

Screenshots:

Running:
image
Results:
image

This was referenced Jun 7, 2024
@vladlearns
Copy link
Author

If you want to implement a similar setup on windows, follow:

  1. Kernel died #215 (comment)
  2. Kernel died #215 (comment)

@vladlearns
Copy link
Author

@wl-zhao, @yuxumin, @Zengyi-Qin, may you take a look, please. I wasn't able to assign a reviewer, this option seems to be disabled in this repo. Thank you!

Copy link

@oldgithubman oldgithubman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved this too soon and I don't know how to (or if I even can) retract it. This does not fix "kernel died" for me and has numerous problems (more than I care to enumerate right now). Needs to go back in the oven. I'll probably just move on. I've wasted far too much time trying to get this project to work. Good luck.

2024-06-11 08:25:45 (17.9 MB/s) - ‘checkpoints_v2_0417.zip’ saved [122086901/122086901]

Archive:  checkpoints_v2_0417.zip
   creating: /tmp/extract_temp/checkpoints_v2/
   creating: /tmp/extract_temp/checkpoints_v2/base_speakers/
   creating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/fr.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/en-us.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/en-india.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/en-br.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/es.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/en-newest.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/jp.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/en-default.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/kr.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/zh.pth  
  inflating: /tmp/extract_temp/checkpoints_v2/base_speakers/ses/en-au.pth  
   creating: /tmp/extract_temp/checkpoints_v2/converter/
  inflating: /tmp/extract_temp/checkpoints_v2/converter/config.json  
  inflating: /tmp/extract_temp/checkpoints_v2/converter/checkpoint.pth  
mv: cannot move '/tmp/extract_temp/checkpoints_v2/base_speakers' to '/workspace/checkpoints_v2/base_speakers': Directory not empty
mv: cannot move '/tmp/extract_temp/checkpoints_v2/converter' to '/workspace/checkpoints_v2/converter': Directory not empty
Starting Jupyter Notebook...
[I 2024-06-11 08:25:46.049 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-06-11 08:25:46.051 ServerApp] jupyter_server_terminals | extension was successfully linked.
[W 2024-06-11 08:25:46.052 LabApp] 'notebook_dir' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2024-06-11 08:25:46.053 ServerApp] notebook_dir is deprecated, use root_dir
[I 2024-06-11 08:25:46.053 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-06-11 08:25:46.055 ServerApp] notebook | extension was successfully linked.
[I 2024-06-11 08:25:46.055 ServerApp] Writing Jupyter server cookie secret to /root/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2024-06-11 08:25:46.175 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-06-11 08:25:46.181 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-06-11 08:25:46.182 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-06-11 08:25:46.182 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-06-11 08:25:46.183 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.10/site-packages/jupyterlab
[I 2024-06-11 08:25:46.183 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[I 2024-06-11 08:25:46.183 LabApp] Extension Manager is 'pypi'.
[I 2024-06-11 08:25:46.197 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-06-11 08:25:46.198 ServerApp] notebook | extension was successfully loaded.
[I 2024-06-11 08:25:46.198 ServerApp] Serving notebooks from local directory: /workspace
[I 2024-06-11 08:25:46.198 ServerApp] Jupyter Server 2.14.1 is running at:
[I 2024-06-11 08:25:46.198 ServerApp] http://3b1a70be49a4:8888/tree?token=24d0165bed2a4d5aefc4b79f960fc5f53557b15e02dd5b69
[I 2024-06-11 08:25:46.198 ServerApp]     http://127.0.0.1:8888/tree?token=24d0165bed2a4d5aefc4b79f960fc5f53557b15e02dd5b69
[I 2024-06-11 08:25:46.198 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2024-06-11 08:25:46.199 ServerApp] 
    
    To access the server, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/jpserver-15-open.html
    Or copy and paste one of these URLs:
        http://3b1a70be49a4:8888/tree?token=24d0165bed2a4d5aefc4b79f960fc5f53557b15e02dd5b69
        http://127.0.0.1:8888/tree?token=24d0165bed2a4d5aefc4b79f960fc5f53557b15e02dd5b69
[W 2024-06-11 08:25:46.207 ServerApp] Failed to fetch commands from language server spec finder `pyright`:
    The 'nodejs' trait of a LanguageServerManager instance expected a unicode string, not the NoneType None.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 632, in get
    value = obj._trait_values[self.name]
KeyError: 'nodejs'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/jupyter_lsp/manager.py", line 279, in _autodetect_language_servers
    specs = spec_finder(self) or {}
  File "/usr/local/lib/python3.10/site-packages/jupyter_lsp/specs/utils.py", line 148, in __call__
    "argv": ([mgr.nodejs, node_module, *self.args] if is_installed else []),
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 687, in __get__
    return t.cast(G, self.get(obj, cls))  # the G should encode the Optional
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 649, in get
    value = self._validate(obj, default)
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 722, in _validate
    value = self.validate(obj, value)
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 2945, in validate
    self.error(obj, value)
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 831, in error
    raise TraitError(e)
traitlets.traitlets.TraitError: The 'nodejs' trait of a LanguageServerManager instance expected a unicode string, not the NoneType None.
[I 2024-06-11 08:25:46.208 ServerApp] Skipped non-installed server(s): bash-language-server, dockerfile-language-server-nodejs, javascript-typescript-langserver, jedi-language-server, julia-language-server, python-language-server, python-lsp-server, r-languageserver, sql-language-server, texlab, typescript-language-server, unified-language-server, vscode-css-languageserver-bin, vscode-html-languageserver-bin, vscode-json-languageserver-bin, yaml-language-server
[W 2024-06-11 08:25:46.214 ServerApp] Failed to fetch commands from language server spec finder `pyright`:
    The 'nodejs' trait of a LanguageServerManager instance expected a unicode string, not the NoneType None.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 632, in get
    value = obj._trait_values[self.name]
KeyError: 'nodejs'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/jupyter_lsp/manager.py", line 279, in _autodetect_language_servers
    specs = spec_finder(self) or {}
  File "/usr/local/lib/python3.10/site-packages/jupyter_lsp/specs/utils.py", line 148, in __call__
    "argv": ([mgr.nodejs, node_module, *self.args] if is_installed else []),
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 687, in __get__
    return t.cast(G, self.get(obj, cls))  # the G should encode the Optional
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 649, in get
    value = self._validate(obj, default)
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 722, in _validate
    value = self.validate(obj, value)
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 2945, in validate
    self.error(obj, value)
  File "/usr/local/lib/python3.10/site-packages/traitlets/traitlets.py", line 831, in error
    raise TraitError(e)
traitlets.traitlets.TraitError: The 'nodejs' trait of a LanguageServerManager instance expected a unicode string, not the NoneType None.

@vladlearns
Copy link
Author

vladlearns commented Jun 11, 2024

@oldmanjk Thank you for reviewing the pull and bringing this up, but kernel died can have various causes, this error message can be displayed in everything from lack of memory to missing libs etc.

First issue you are having

The folders should not exist or be populated prior to the checkpoint extraction, this is a docker container. Based on your logs, it seems the first issue you are facing is related to moving the extracted checkpoint files: the directories /workspace/checkpoints_v2/base_speakers and /workspace/checkpoints_v2/converter are not empty, preventing the extracted files from being moved.

In my testing environment, I've built the image from scratch, and it works without any issues. The folders should not exist or be populated prior to the extraction process.

Idk what your build context is, but these are some things that I can think of:

  • Are you using any persistent volumes or bind mounts when running the container? If so, could you try running the container without those mounts to see if the issue persists? If you have previously run the container with persistent volumes or bind mounts attached to those specific directories, the folders might persist even after the container is removed, causing conflicts when building the image again; or if you didn't clean up the previous container instances or the volumes, files might still be there. Could you please ensure that any previous container instances and associated volumes are properly cleaned up before building the image again?
  • Are there any additional files or directories in your build context that might be causing the folders to be populated?

Second issue

For the second one. Looks like LSP tries to autodetect and start lang servers and it looks for the nodejs executable path. I have the container running right now, here:
telegram-cloud-photo-size-2-5422489872008077605-w
It looks like the only mention of jupyterlab-lsp, that requires node is your comment in this pull, so I assume that this is related to your particular setup.
image

@oldmanjk, could you please check if you have any user-specific jupyter configurations, additional jupyter extensions, or dev environment settings that might be enabling or interacting with the LSP extension? If so, try disabling or removing them and rebuilding the container to see if the errors persist

If the issue still persists after considering these, I'd be happy to work with you to investigate further and find a solution. We can explore additional steps.

@vladlearns vladlearns requested a review from oldgithubman June 11, 2024 20:43
@oldgithubman
Copy link

@oldmanjk Thank you for reviewing the pull and bringing this up, but kernel died can have various causes, this error message can be displayed in everything from lack of memory to missing libs etc.

First issue you are having

The folders should not exist or be populated prior to the checkpoint extraction, this is a docker container. Based on your logs, it seems the first issue you are facing is related to moving the extracted checkpoint files: the directories /workspace/checkpoints_v2/base_speakers and /workspace/checkpoints_v2/converter are not empty, preventing the extracted files from being moved.

In my testing environment, I've built the image from scratch, and it works without any issues. The folders should not exist or be populated prior to the extraction process.

Idk what your build context is, but these are some things that I can think of:

  • Are you using any persistent volumes or bind mounts when running the container? If so, could you try running the container without those mounts to see if the issue persists? If you have previously run the container with persistent volumes or bind mounts attached to those specific directories, the folders might persist even after the container is removed, causing conflicts when building the image again; or if you didn't clean up the previous container instances or the volumes, files might still be there. Could you please ensure that any previous container instances and associated volumes are properly cleaned up before building the image again?
  • Are there any additional files or directories in your build context that might be causing the folders to be populated?

Second issue

For the second one. Looks like LSP tries to autodetect and start lang servers and it looks for the nodejs executable path. I have the container running right now, here: telegram-cloud-photo-size-2-5422489872008077605-w It looks like the only mention of jupyterlab-lsp, that requires node is your comment in this pull, so I assume that this is related to your particular setup. image

@oldmanjk, could you please check if you have any user-specific jupyter configurations, additional jupyter extensions, or dev environment settings that might be enabling or interacting with the LSP extension? If so, try disabling or removing them and rebuilding the container to see if the errors persist

If the issue still persists after considering these, I'd be happy to work with you to investigate further and find a solution. We can explore additional steps.

Thanks for the fast and thorough response. Unfortunately, I have deleted everything and moved on. Good luck though!

@npjonath
Copy link

@vladlearns I have been working in parallels on a fix for the Dockerfile that will suit the CPU setup, particularly for Mac series M and similar systems. I have finally chanced upon a solution. Considering your work on this matter, perhaps we can combine our efforts. We could develop specialized Dockerfiles; one for CUDA and another for CPU. Correspondingly, we could generate docker-compose files (docker-compose.cuda.yml and docker-compose.cpu.yml). What do you think?

My work : npjonath#1

note: this PR also include the fix from @Afnanksalal, as this is a requirement to run this project on CPU based architecture. (#262)

The Openvoice V1 work correctly on my setup. The V2 is still not working because of this issue from MeloTTS

Issue: myshell-ai/MeloTTS#167
And a possible solution for this by running a specific version of MeloTTS : https://github.com/Meiye-lj/Dockerfiles/blob/76c88309a4bb7b7070441bed3b4b72231f5349b8/MeloTTS/Dockerfile

@oldgithubman
Copy link

I don't use this project anymore, so I probably shouldn't be a requested reviewer

@vladlearns
Copy link
Author

@oldgithubman You added yourself by approving the PR and then dismissing the review because of your environment. Later, you decided to leave without providing any details. Now, when I ask for a review, you are automatically added, and there is no way to remove you.
image

@vladlearns
Copy link
Author

@vladlearns I have been working in parallels on a fix for the Dockerfile that will suit the CPU setup, particularly for Mac series M and similar systems. I have finally chanced upon a solution. Considering your work on this matter, perhaps we can combine our efforts. We could develop specialized Dockerfiles; one for CUDA and another for CPU. Correspondingly, we could generate docker-compose files (docker-compose.cuda.yml and docker-compose.cpu.yml). What do you think?

My work : npjonath#1

note: this PR also include the fix from @Afnanksalal, as this is a requirement to run this project on CPU based architecture. (#262)

The Openvoice V1 work correctly on my setup. The V2 is still not working because of this issue from MeloTTS

Issue: myshell-ai/MeloTTS#167 And a possible solution for this by running a specific version of MeloTTS : https://github.com/Meiye-lj/Dockerfiles/blob/76c88309a4bb7b7070441bed3b4b72231f5349b8/MeloTTS/Dockerfile

@npjonath Hey!
So, you just want me to rename the file?

@npjonath
Copy link

@vladlearns No it was just for talking about this with you. You can leave the naming as it. I guess GPU usage is the default one. I will add docker-compose file and Dockerfile.cpu separately to extends your implementation.

@oldgithubman
Copy link

@oldgithubman You added yourself by approving the PR and then dismissing the review because of your environment. Later, you decided to leave without providing any details. Now, when I ask for a review, you are automatically added, and there is no way to remove you.
image

Ok. I don't really know what I'm doing. I'll just approve it so you can move on

@vladlearns
Copy link
Author

@npjonath Sure. So far, I've tested my setup on multiple environments. It works for multiple people as well, but it seems they don't merge pull requests into the main branch. Instead, they ask contributors to fork the repository and point to the fork in the documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants