Experiment with Kueue via YAMLS #2

leonpawelzik · 2024-03-28T14:05:08Z

Explore Kueue functionality through YAML configuration

As part of our infrastructure product development, we need to investigate and experiment with Kueue. The focus of this ticket is the exploration of Kueue's features and capabilities using YAML configuration files.

Please store the .yaml files that were used for experimentation in this repository.

1. Use team development environment [LINK TO ISSUE]
2. Install Kueue on the Kubernetes cluster following the official documentation.
3. Create YAML configuration files for the following Kueue resources: https://gist.github.com/AdrianoKF/52a74a7c701a0aa19f63ffe298eb51a4#file-single-clusterqueue-setup-yaml
- Queues
- Resource Flavors
- Resource Quotas
4. Submit jobs using YAML configuration files:
- Create different types of jobs (e.g., machine learning training, data transformation) with different resource requirements.
  - Sample Kubernetes Job: https://gist.github.com/AdrianoKF/52a74a7c701a0aa19f63ffe298eb51a4#file-sample-job-yaml
- Assign jobs to the appropriate queues based on their priority and resource needs.
5. Monitor and observe the behavior of Kueue:
- Check the status of submitted jobs using Kubernetes and Kueue CLI tools.
- Verify correct job scheduling.
- Analyze the resource utilization of the Kubernetes cluster while jobs are running.
6. Document the findings (i.e.):
- Limitations, challenges, or areas for improvement encountered during the experimentation.
- Observations on how Kueue schedules and manages jobs.
- YAML configuration examples (See 4.).
- Recommendations / Potential issues in terms of scalability, reliability, and user experience.

AdrianoKF · 2024-04-09T09:58:41Z

(5) Monitoring:

minikube addons enable metrics-server, kubectl top node is a good starting point.
- One has to be aware of the default metrics scraping interval of 60s, which is not granular enough for our purposes. Solution: increase the resolution to 10s, kubectl patch -n kube-system deployment/metrics-server --type=json --patch '[{ "op": "replace", "path": "/spec/template/spec/containers/0/args/4", "value": "--metric-resolution=10s" }]'
Monitoring load on the system level is also feasible: minikube ssh top (most interesting: load average header, %CPU, %MEM for individual processes). Advantage: less delay than k8s metrics

AdrianoKF · 2024-04-09T11:09:35Z

(4) - `RayCluster` workloads

See https://kueue.sigs.k8s.io/docs/tasks/run/rayclusters/

tl;dr: When you put the kueue.x-k8s.io/queue-name label on a RayCluster CR, Kueue will govern the creation of the Ray cluster (i.e., will apply the resource quota against the node resources requested for the Ray cluster)

Installation steps for Ray (see docs):

Set up Helm repo: helm repo add kuberay https://ray-project.github.io/kuberay-helm/
Install KubeRay operator: helm install kuberay-operator kuberay/kuberay-operator
Render a RayCluster CR and add the Kueue label kueue.x-k8s.io/queue-name: user-queue there: helm template raycluster kuberay/ray-cluster -f raycluster-values.yaml > raycluster-manifest.yaml

Note

This might be possible directly with helm install in the future, I've raised an issue: https://github.com/ray-project/kuberay-helm/issues/34`

Apply the manifest: kubectl apply -f raycluster-manifest.yaml

kubectl get workload shows the Ray cluster:

NAME                                  QUEUE        ADMITTED BY     AGE
raycluster-raycluster-kuberay-fd8e5   user-queue   cluster-queue   14m

In order to limit the GPU usage, modify the ClusterQueue to also cover the nvidia.com/gpu resource type (and recreate the resources):

  resourceGroups:
-    - coveredResources: ["cpu", "memory"]
+    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: "default-flavor"
          resources:
            - name: "cpu"
              nominalQuota: 4
            - name: "memory"
              nominalQuota: 6Gi
+            - name: "nvidia.com/gpu"
+              nominalQuota: 1

leonpawelzik · 2024-04-12T13:24:47Z

This ticket can be closed, correct @AdrianoKF ?
(4) - (6) are already covered, or?

leonpawelzik transferred this issue from another repository Apr 2, 2024

leonpawelzik added this to the Proof of Concept (Demo) milestone Apr 2, 2024

leonpawelzik added the experimentation label Apr 4, 2024

leonpawelzik closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with Kueue via YAMLS #2

Experiment with Kueue via YAMLS #2

leonpawelzik commented Mar 28, 2024 •

edited

Loading

AdrianoKF commented Apr 9, 2024 •

edited

Loading

AdrianoKF commented Apr 9, 2024 •

edited

Loading

leonpawelzik commented Apr 12, 2024

Experiment with Kueue via YAMLS #2

Experiment with Kueue via YAMLS #2

Comments

leonpawelzik commented Mar 28, 2024 • edited Loading

Explore Kueue functionality through YAML configuration

AdrianoKF commented Apr 9, 2024 • edited Loading

(5) Monitoring:

AdrianoKF commented Apr 9, 2024 • edited Loading

(4) - RayCluster workloads

leonpawelzik commented Apr 12, 2024

leonpawelzik commented Mar 28, 2024 •

edited

Loading

AdrianoKF commented Apr 9, 2024 •

edited

Loading

AdrianoKF commented Apr 9, 2024 •

edited

Loading

(4) - `RayCluster` workloads