Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod all restart with last version chart #30

Open
didlawowo opened this issue Oct 23, 2024 · 3 comments
Open

pod all restart with last version chart #30

didlawowo opened this issue Oct 23, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@didlawowo
Copy link

i have deployed with helm
previous version was working like a charm
and got some restart / error

Name:             langfuse-556d667545-z2gsx
Namespace:        mlops
Priority:         0
Service Account:  langfuse
Node:             rtx/192.168.1.29
Start Time:       Wed, 23 Oct 2024 12:19:19 +0200
Labels:           app.kubernetes.io/instance=langfuse
                  app.kubernetes.io/name=langfuse
                  kuik.enix.io/managed=true
                  pod-template-hash=556d667545
Annotations:      kuik.enix.io/rewrite-images: true
Status:           Running
IP:               10.0.3.146
IPs:
  IP:           10.0.3.146
Controlled By:  ReplicaSet/langfuse-556d667545
Containers:
  langfuse:
    Container ID:   containerd://4ce8023b1b941f97bfd814dbb7a2ef85a312bdb4161a03b410faddb757a33ae3
    Image:          ghcr.io/langfuse/langfuse:2
    Image ID:       ghcr.io/langfuse/langfuse@sha256:bd6b98db2706a16529ef0b59b618463bef1cc0be2b60cfa776273a06595977a0
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 23 Oct 2024 19:02:19 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    143
      Started:      Wed, 23 Oct 2024 18:59:04 +0200
      Finished:     Wed, 23 Oct 2024 18:59:31 +0200
    Ready:          True
    Restart Count:  8
    Liveness:       http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/api/public/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      NODE_ENV:                      production
      HOSTNAME:                      0.0.0.0
      PORT:                          3000
      DATABASE_USERNAME:             postgres
      DATABASE_PASSWORD:             <set to the key 'postgres-password' in secret 'langfuse-postgresql'>  Optional: false
      DATABASE_HOST:                 langfuse-postgresql
      DATABASE_NAME:                 postgres_langfuse
      NEXTAUTH_URL:                  https://langfuse.dc-tech.work
      NEXTAUTH_SECRET:               <set to the key 'nextauth-secret' in secret 'langfuse-nextauth'>  Optional: false
      SALT:                          changeme
      TELEMETRY_ENABLED:             true
      NEXT_PUBLIC_SIGN_UP_DISABLED:  false
      ENABLE_EXPERIMENTAL_FEATURES:  false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bsqvg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  kube-api-access-bsqvg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Normal   Pulling    20m (x4 over 6h44m)     kubelet  Pulling image "ghcr.io/langfuse/langfuse:2"
  Normal   Pulled     20m                     kubelet  Successfully pulled image "ghcr.io/langfuse/langfuse:2" in 3.234s (3.247s including waiting). Image size: 218575921 bytes.
  Normal   Created    20m (x4 over 6h43m)     kubelet  Created container langfuse
  Normal   Started    20m (x4 over 6h43m)     kubelet  Started container langfuse
  Warning  Unhealthy  20m (x2 over 6h43m)     kubelet  Liveness probe failed: Get "http://10.0.3.146:3000/": dial tcp 10.0.3.146:3000: connect: connection refused
  Normal   Killing    20m (x2 over 21m)       kubelet  Container langfuse failed liveness probe, will be restarted
  Warning  Unhealthy  11m (x6 over 21m)       kubelet  Liveness probe failed: Get "http://10.0.3.146:3000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  11m (x9 over 21m)       kubelet  Readiness probe failed: Get "http://10.0.3.146:3000/api/public/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  6m52s (x40 over 6h43m)  kubelet  Readiness probe failed: Get "http://10.0.3.146:3000/api/public/health": dial tcp 10.0.3.146:3000: connect: connection refused
  Warning  BackOff    108s (x14 over 6h43m)   kubelet  Back-off restarting failed container langfuse in pod langfuse-556d667545-z2gsx_mlops(50343193-a7c6-4f42-8738-85177df5abef)
@dosubot dosubot bot added the bug Something isn't working label Oct 23, 2024
Copy link

dosubot bot commented Oct 23, 2024

Hey there, @didlawowo! I'm here to help with any bugs, questions, or contributions you have. Let's tackle this issue together!

The pod restarts and errors you're experiencing with the latest version of the Langfuse Helm chart seem to be related to the liveness and readiness probes failing. The logs indicate that the probes are unable to connect to the specified endpoints, resulting in the pod being marked as unhealthy and subsequently restarted.

Here are a few steps you can take to troubleshoot and resolve the issue:

  1. Check Service Availability: Ensure that the Langfuse service is actually running and accessible on the expected port (3000). You can try accessing the service directly from within the cluster using a tool like curl to verify connectivity.

  2. Increase Probe Timeouts: The current timeout for both liveness and readiness probes is set to 1 second. This might be too short if the service takes longer to respond. Consider increasing the timeout to give the service more time to respond.

  3. Review Recent Changes: If the previous version was working fine, review any changes made in the latest version of the Helm chart that might affect the service's startup time or resource usage. There might be changes that require additional configuration or resources.

  4. Check Resource Limits: Ensure that the pod has sufficient CPU and memory resources allocated. Insufficient resources can lead to slow startup times and probe failures.

  5. Database Connectivity: Since Langfuse relies on a Postgres database, ensure that the database is accessible and properly configured. Any issues with the database connectivity can affect the readiness of the application.

  6. Logs and Metrics: Check the application logs and any available metrics for more detailed error messages or performance issues that might be causing the probes to fail.

By addressing these areas, you should be able to identify and resolve the cause of the pod restarts.

To continue talking to Dosu, mention @dosu.

@marcklingen
Copy link
Member

can you share logs from the container instance?

@Steffen911
Copy link
Contributor

@didlawowo Could you also share the resources that you've allocated to the pod? How much traffic does it usually receive over a minute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants