Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hasura metadata apply uses massive amount of memory, ignores kubernetes resource limits #10601

Open
raweber42 opened this issue Nov 15, 2024 · 5 comments
Labels
k/bug Something isn't working

Comments

@raweber42
Copy link

Version Information

Server Version:
CLI Version (for CLI related issue): hasura/graphql-engine:v2.44.0

Environment

Self-hosted

What is the current behaviour?

When running hasura metadata apply in a kubernetes deployment, it gets killed because it is hitting my specified memory limit. This is running in a seperate container next to the actual hasura container.

I even went up with the limit to as much as 3Gi, but the container seems to ignore this limit, trying to get all memory possible (I guess it sees the memory limit of the kubernetes node, not the container inside of the pod lives in. When not specifying a limit whatsoever, it works. I can see, that the memory usage spikes to 1.5Gi max during running hasura metadata apply. So specifying a limit of 2Gi should be sufficient. But what (I think) happens, is, that our cluster spins up a new node such that there is even more memory available. And the apply command just tries to grab it all, ignoring the specified limit of the kubernetes container.

So this does not work:

          resources:
            requests:
              memory: 512Mi
              cpu: 300m
            limits:
              memory: 3Gi
              cpu: 300m

But this does:

          resources:
            requests:
              memory: 512Mi
              cpu: 300m
            limits:
              # Omitting memory limit
              cpu: 300m

For the record: We don't have a crazy amount of metadata. Couple of remote schemas and ~12 DBs with permissions.

What is the expected behaviour?

metadata apply stays within the limit specified for memory in the deployment manifest. And probably should also not use such a high amount of memory in the first place.

How to reproduce the issue?

  1. Spin up 2 containers on kubernetes (one with hasura, one with access to the metadata DB and metadata files and hasura-cli installed)
  2. Define low memory limit for the second container (lime 256Mi)
  3. Run the hasura metadata apply command with a reasonable amount of metadata to apply.
  4. See that it gets killed (because it uses too much memory)
  5. Raise the memory limit to e.g. 1Gi
  6. Run the hasura metadata apply command again
  7. See that it still gets killed
  8. (Optional: Inspect resource usage with a tool of your choice to see the memory usage spike)

Screenshots or Screencast

Please provide any traces or logs that could help here.

Any possible solutions/workarounds you're aware of?

Don't set a memory limit at all.

Keywords

memory, apply, metadata

@raweber42 raweber42 added the k/bug Something isn't working label Nov 15, 2024
@robertjdominguez
Copy link
Contributor

Hi, @raweber42 👋

I've reached out to our SRE team to see what guidance they can give on this behavior.

@nizar-m
Copy link
Contributor

nizar-m commented Nov 20, 2024

The CPU and memory limits are enforced for a Kubernetes container in different ways.

When it comes to CPU, the kernel can actually limit the clock cycles allocated to a process, ensuring that a process does not get more than what is allocated to it as the CPU limits.

If the memory does not grow as required by a process, then that process would simply hang. So kernel ensures that the process won't use more memory than specified by sending an Out Of Memory (OOM) event when the memory usage exceeds the limit. This results in an OOMKill of the process.

So the CPU limit will limit the CPU usage, while the memory limit will OOMKill the process if it exceeds the limit specified.

This is the same behavior, whether the process is run as a Kubernetes container with memory limits, or as a systemd process with memory limits.

@nizar-m
Copy link
Contributor

nizar-m commented Nov 20, 2024

Regarding the high memory usage of metadata apply, it would be helpful to debug if you can share the metadata.

@raweber42
Copy link
Author

raweber42 commented Nov 20, 2024

Thanks @nizar-m for getting back!

I get why the container is being killed by kubernetes, thank you. The question is: Why does hasura use that much memory? and besides that: How do I know how to set an appropriate memory limit for hasura in kubernetes?

Sadly, I cannot share our metadata just like this, but I can give you some more insights regarding what they contain.

Our metadata includes:

  • 12 databases of which most contain ~3 tables, but one even has 15 tables
  • one table includes SQL functions
  • all tables have some permissions for overall 4 different roles
  • all databases live in separate microservices
  • we set connection limits for those services because we having connection problems with the host running out of connections in the past.

Here is an example of a database.yaml config that looks similar to ours:

- name: db1
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL1
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 10
      use_prepared_statements: false
  tables: "!include db1/tables/tables.yaml"
- name: db2
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL2
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 2
      use_prepared_statements: false
  tables: "!include db2/tables/tables.yaml"
- name: db3
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL3
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 2
      use_prepared_statements: false
  tables: "!include db3/tables/tables.yaml"
  functions: "!include db3/functions/functions.yaml"
- name: db4
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL4
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 8
      use_prepared_statements: false
  tables: "!include db4/tables/tables.yaml"
- name: db5
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL5
      isolation_level: read-committed
      pool_settings:
        max_connections: 2
      use_prepared_statements: false
  tables: "!include db5/tables/tables.yaml"
- name: db6
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL6
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 2
      use_prepared_statements: false
  tables: "!include db6/tables/tables.yaml"
- name: db7
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL7
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 2
      use_prepared_statements: false
  tables: "!include db7/tables/tables.yaml"
- name: db8
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL8
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 2
      use_prepared_statements: false
  tables: "!include db8/tables/tables.yaml"
- name: db9
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL9
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 8
      use_prepared_statements: false
  tables: "!include db9/tables/tables.yaml"
- name: db10
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL10
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 2
      use_prepared_statements: false
  tables: "!include db10/tables/tables.yaml"
- name: db11
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: MICROSERVICE_DB_URL11
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 600
        max_connections: 15
      use_prepared_statements: false
  tables: "!include db11/tables/tables.yaml"

Additionally we use:

  • a couple of actions
  • remote schemas for custom mutations, exposed by every microservice
    ---> these total to around 50 custom mutations and around 15 queries, evenly distributed between all the databases/microservices

I would be very glad to get some insights on how hasura handles such an - admittedly big - amount of metadata. And if you could give me a recommendation on how to find the right memory limit (I tried trial and error, but as described above, the memory usage does not seem to be very predictable. When not setting a limit, I can see that the memory usage is around 1.5Gi. But even when setting it to 2Gi, the container gets OOM-killed 😅), I would be very grateful!

@nizar-m
Copy link
Contributor

nizar-m commented Nov 21, 2024

The memory usage could be 1.5Gi according to the collected metrics. But during metadata apply, it might be exceeding the 2Gi limit for a brief period of time, resulting in an OOMKill.

Hasura builds in-memory structures for quickly serving the queries, with size roughly proportional to tables x roles. This needs to be found via trial and error, and memory usage can fluctuate depending on usage patterns.

I get that the metadata apply is taking a lot of memory. During the design of graphql-engine, our main focus was on making the latency of the GraphQL queries as minimum as possible. Doing that does seem to result in high memory usage for some users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants