Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Resource limit is not set when using --cpus 2 #4482

Open
gaocegege opened this issue Dec 18, 2024 · 3 comments
Open

[k8s] Resource limit is not set when using --cpus 2 #4482

gaocegege opened this issue Dec 18, 2024 · 3 comments

Comments

@gaocegege
Copy link

When I set sky launch --cpus 2 ./hello-sky/task.yaml, I expected to set the CPU resource limit to 2. But I only get the resource request=2.

$ sky launch --cpus 2 ./hello-sky/task.yaml 
Task from YAML spec: ./hello-sky/task.yaml
Considered resources (1 node):
---------------------------------------------------------------------------------------------
 CLOUD        INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
---------------------------------------------------------------------------------------------
 Kubernetes   2CPU--2GB   2       2         -              kind-kind     0.00          ✔     
---------------------------------------------------------------------------------------------
Launching a new cluster 'sky-1eee-gaocegege'. Proceed? [Y/n]: 
$ kubectl get pods -o yaml
...
    - containerPort: 8266
      protocol: TCP
    resources:
      requests:
        cpu: "2"
        memory: 2G
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
...

I'm not sure whether this is a bug, a feature, or if I simply overlooked something.

Version & Commit info:

  • sky -v: skypilot, version 1.0.0-dev0
  • sky -c: skypilot, commit f0ebf13
@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Dec 18, 2024

Hey @gaocegege - this is by design. We do not set CPU limits to let pods use idle resources on nodes. See https://home.robusta.dev/blog/stop-using-cpu-limits for an explaination. Curious to learn why you'd like to enforce limits.

For memory, there is an argument to be made about using limits since it is incompressible, but we found that strict memory enforcement leads with limits to worse user experience (most users don't know exactly the memory they need, end up with wasted memory or premature OOMKills).

@gaocegege
Copy link
Author

Curious to learn why you'd like to enforce limits.

I understand that CPU throttling is not ideal for jobs. In my situation, I have a cluster with one node and 32 CPUs. I want to run a local inference service alongside some jobs. My goal is to allocate more CPUs to inference while setting limits on the jobs.

I found this in the documentation https://docs.skypilot.co/en/latest/reference/cli.html#sky-launch

Number of vCPUs each instance must have (e.g., --cpus=4 (exactly 4) or --cpus=4+ (at least 4)). This is used to automatically select the instance type.

This makes me think that --cpus=4 is different from --cpus=4+. Without limits in a Kubernetes cloud environment, there wouldn’t be any difference.

@gaocegege
Copy link
Author

gaocegege commented Dec 19, 2024

For memory, there is an argument to be made about using limits since it is incompressible, but we found that strict memory enforcement leads with limits to worse user experience (most users don't know exactly the memory they need, end up with wasted memory or premature OOMKills).

I agree that most users do not know the mem usage, especially for the job. but is it possible to support rich syntax in the --cpus and --memory arguments? For example:

  • --cpus=2-4: means requests=2, limit=4
  • --cpus=2: means limit=2
  • --cpus=2+: means requests=2

This would allow for more flexible resource management. I understand that user experience matters, but keeping extensions for users to maintain advanced scheduling limits is also important from the admin's perspective. Additionally, it is necessary to improve utilization.

@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants