Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Install Nvidia DOCA on the servers post provisioning #2219

Draft
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

glimchb
Copy link
Contributor

@glimchb glimchb commented Jan 16, 2024

Issues Resolved by this Pull Request

Fixes #

Description of the Solution

  • If nvidia_doca_path is provided in input/provision_config.yml and Nvidia DPUs are available on the target nodes, DOCA packages will be deployed post provisioning without user intervention.
  • DOCA can also be installed using network.yml after provisioning the servers (Assuming the provision tool did not install DOCA packages).
  • The DOCA package can be downloaded from https://developer.nvidia.com/networking/doca

From Nvidia documentation:

$ wget https://www.mellanox.com/downloads/DOCA/DOCA_v2.5.0/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm
$ rpm -Uvh doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm
$ yum makecache
$ sudo yum install doca-runtime
$ sudo yum install doca-tools

Suggested Reviewers

@sujit-jadhav

@glimchb glimchb force-pushed the doca branch 11 times, most recently from d4ab28d to 508e321 Compare January 16, 2024 19:01
# Absolute path to local copy of .tgz file containing DOCA package.
# The package can be downloaded from https://developer.nvidia.com/networking/doca/
# Optional variable.
nvidia_doca_offline_path: ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: during testing I see mix between nvidia_doca_path and nvidia_doca_offline_path need to review this is more details

# Usage: configure_doca.yml
doca_tmp_path: /tmp/doca
doca_core_path: /install/doca/x86_64/doca-core
doca_deps_path: /install/doca/x86_64/doca-deps
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to review this section, too many parameters...

# limitations under the License.
---

- name: Delete doca repo folders
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to review this entire file, looks like copy paste from cuda, need more attention here...

ansible.builtin.include_role:
name: nvidia_doca
tasks_from: validations.yml

- name: Check nodes having Infiniband Support
hosts: all
tasks:
Copy link
Contributor Author

@glimchb glimchb Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: missing code to actually start DOCA installation in this file from roles nvidia_doca

block:
- name: Install packages from doca rpm file
ansible.builtin.yum:
name: "{{ doca_filepath }}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: need to understand NFS and nvidia_doca_path vs doca_filepath

- name: Include vars file of inventory role
ansible.builtin.include_vars: "{{ role_path }}/../../../input/network_config.yml"

# - name: Check status of doca installation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this or can remove it ?

os_supported_rocky: "rocky"
os_supported_rhel: "redhat"

doca_repo_url: "https://linux.mellanox.com/public/repo/doca/{{ nvidia_doca_version }}/rhel/{{ compute_os_version }}/x86_64"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct URL example is https://linux.mellanox.com/public/repo/doca/2.5.0/rhel8.0/x86_64/
please replace rhel with variable so can be used with other distros...

when: nvidia_doca_path | default("", true) | length > 0

# - name: Validate nvidia_doca_version
# ansible.builtin.assert:
Copy link
Contributor Author

@glimchb glimchb Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this code or it can be removed ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant