Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with Kueue via YAMLS #2

Closed
3 of 6 tasks
leonpawelzik opened this issue Mar 28, 2024 · 3 comments
Closed
3 of 6 tasks

Experiment with Kueue via YAMLS #2

leonpawelzik opened this issue Mar 28, 2024 · 3 comments

Comments

@leonpawelzik
Copy link
Collaborator

leonpawelzik commented Mar 28, 2024

Explore Kueue functionality through YAML configuration

As part of our infrastructure product development, we need to investigate and experiment with Kueue. The focus of this ticket is the exploration of Kueue's features and capabilities using YAML configuration files.

Please store the .yaml files that were used for experimentation in this repository.

  • 1. Use team development environment [LINK TO ISSUE]

  • 2. Install Kueue on the Kubernetes cluster following the official documentation.

  • 3. Create YAML configuration files for the following Kueue resources: https://gist.github.com/AdrianoKF/52a74a7c701a0aa19f63ffe298eb51a4#file-single-clusterqueue-setup-yaml

    • Queues
    • Resource Flavors
    • Resource Quotas
  • 4. Submit jobs using YAML configuration files:

  • 5. Monitor and observe the behavior of Kueue:

    • Check the status of submitted jobs using Kubernetes and Kueue CLI tools.
    • Verify correct job scheduling.
    • Analyze the resource utilization of the Kubernetes cluster while jobs are running.
  • 6. Document the findings (i.e.):

    • Limitations, challenges, or areas for improvement encountered during the experimentation.
    • Observations on how Kueue schedules and manages jobs.
    • YAML configuration examples (See 4.).
    • Recommendations / Potential issues in terms of scalability, reliability, and user experience.
@leonpawelzik leonpawelzik transferred this issue from another repository Apr 2, 2024
@leonpawelzik leonpawelzik added this to the Proof of Concept (Demo) milestone Apr 2, 2024
@AdrianoKF
Copy link
Collaborator

AdrianoKF commented Apr 9, 2024

(5) Monitoring:

  • minikube addons enable metrics-server, kubectl top node is a good starting point.
    • One has to be aware of the default metrics scraping interval of 60s, which is not granular enough for our purposes. Solution: increase the resolution to 10s, kubectl patch -n kube-system deployment/metrics-server --type=json --patch '[{ "op": "replace", "path": "/spec/template/spec/containers/0/args/4", "value": "--metric-resolution=10s" }]'
  • Monitoring load on the system level is also feasible: minikube ssh top (most interesting: load average header, %CPU, %MEM for individual processes). Advantage: less delay than k8s metrics

@AdrianoKF
Copy link
Collaborator

AdrianoKF commented Apr 9, 2024

(4) - RayCluster workloads

See https://kueue.sigs.k8s.io/docs/tasks/run/rayclusters/

tl;dr: When you put the kueue.x-k8s.io/queue-name label on a RayCluster CR, Kueue will govern the creation of the Ray cluster (i.e., will apply the resource quota against the node resources requested for the Ray cluster)

Installation steps for Ray (see docs):

  1. Set up Helm repo: helm repo add kuberay https://ray-project.github.io/kuberay-helm/
  2. Install KubeRay operator: helm install kuberay-operator kuberay/kuberay-operator
  3. Render a RayCluster CR and add the Kueue label kueue.x-k8s.io/queue-name: user-queue there: helm template raycluster kuberay/ray-cluster -f raycluster-values.yaml > raycluster-manifest.yaml

Note

This might be possible directly with helm install in the future, I've raised an issue: https://github.com/ray-project/kuberay-helm/issues/34`

  1. Apply the manifest: kubectl apply -f raycluster-manifest.yaml
  2. kubectl get workload shows the Ray cluster:
    NAME                                  QUEUE        ADMITTED BY     AGE
    raycluster-raycluster-kuberay-fd8e5   user-queue   cluster-queue   14m
    

In order to limit the GPU usage, modify the ClusterQueue to also cover the nvidia.com/gpu resource type (and recreate the resources):

  resourceGroups:
-    - coveredResources: ["cpu", "memory"]
+    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: "default-flavor"
          resources:
            - name: "cpu"
              nominalQuota: 4
            - name: "memory"
              nominalQuota: 6Gi
+            - name: "nvidia.com/gpu"
+              nominalQuota: 1

@leonpawelzik
Copy link
Collaborator Author

This ticket can be closed, correct @AdrianoKF ?
(4) - (6) are already covered, or?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants