Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Nodes randomly booting into maintenance-mode and stuck in NotReady state #1499

Open
treylade opened this issue Oct 11, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@treylade
Copy link

Description

Hello,

My kube.tf config generally works perfectly fine and I am very pleased with the solution.
Recently I observed, that randomly single nodes go into state "NotReady,SchedulingDisabled". It is usually just a single node.
I do not have SSH-Access to that node anymore since the network is unreachable. A manual reboot would solve the issue but this can not be the permanent solution. I have to rely on my nodes staying in "Ready" state. The issue is not being automatically healed.

I reached out to Hetzner support to investigate if this is an issue on their side. The answer I got was, that the node did boot into "Maintenance mode" which usually happens due to kernel- or filesystem-related issues.

Kubernetes version: v1.30.5+k3s1
Kernel version: 6.11.0-1-default

I am using Longhorn as storage provider.

Do you have any experience with that issue?

Best regards,
Lars

Kube.tf file

locals {
  num_control_planes = 3
  control_plane_type = "cax21"
  num_workers_sm = 12
  worker_type_sm = "cax21"
  num_workers_md = 4
  worker_type_md = "cax31"
}

variable "OP_SERVICE_ACCOUNT_TOKEN" {
  description = "The 1Password service account token"
  type        = string
  sensitive   = true
}

variable "OP_VAULT_UUID" {
  description = "The 1Password vault uuid"
  type        = string
  sensitive   = true
}

provider "onepassword" {
  op_cli_path           = "/opt/homebrew/bin/op"
  service_account_token = var.OP_SERVICE_ACCOUNT_TOKEN
}

data "onepassword_vault" "secrets_vault" {
  uuid = var.OP_VAULT_UUID
}

data "onepassword_item" "hcloud_token" {
  vault    = data.onepassword_vault.secrets_vault.uuid
  title    = "Hetzner Cloud API Token"
}

data "onepassword_item" "internal_s3_credential" {
  vault = data.onepassword_vault.secrets_vault.uuid
  title = "S3 Credential"
}

provider "hcloud" {
  token = data.onepassword_item.hcloud_token.credential
}

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.43.0"
    }
    onepassword = {
      source = "1Password/onepassword"
      version = ">= 2.1.2"
    }
  }
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }

  hcloud_token = data.onepassword_item.hcloud_token.credential
  source = "kube-hetzner/kube-hetzner/hcloud"
  
  ssh_public_key  = file("~/.ssh/production_deploy_key.pub")
  ssh_private_key = file("~/.ssh/production_deploy_key")

  network_region = "eu-central"

  control_plane_nodepools = [
    {
      name        = "control-plane",
      server_type = local.control_plane_type,
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = local.num_control_planes
    }
  ]

  agent_nodepools = [
    {
      name                 = "worker-sm-1",
      server_type          = local.worker_type_sm,
      location             = "fsn1",
      labels               = [],
      taints               = [],
      count                = local.num_workers_sm / 2,
      longhorn_volume_size = 0
      placement_group      = "worker-sm-1"
    },
    {
      name                 = "worker-sm-2",
      server_type          = local.worker_type_sm,
      location             = "fsn1",
      labels               = [],
      taints               = [],
      count                = local.num_workers_sm / 2,
      longhorn_volume_size = 0
      placement_group      = "worker-sm-2"
    },
    {
      name                 = "worker-md",
      server_type          = local.worker_type_md,
      location             = "fsn1",
      labels               = [],
      taints               = [],
      count                = local.num_workers_md,
      longhorn_volume_size = 0
      placement_group      = "worker-md"
    }
  ]

  control_planes_custom_config = {
    etcd-expose-metrics = true
  }

  load_balancer_type = false

  etcd_s3_backup = {
    etcd-s3-endpoint        = "s3.eu-central-3.ionoscloud.com"
    etcd-s3-access-key      = data.onepassword_item.internal_s3_credential.username
    etcd-s3-secret-key      = data.onepassword_item.internal_s3_credential.credential
    etcd-s3-bucket          = "etcd-backup"
    etcd-s3-folder          = "production"
  }

  ingress_controller = "none"

  enable_metrics_server = false
  enable_local_storage = true

  automatically_upgrade_k3s = true
  automatically_upgrade_os  = true
  initial_k3s_channel       = "stable"

  cluster_name = "production-k3s"
  use_cluster_name_in_node_name = false

  extra_firewall_rules = [
    {
      description     = "Allow Outbound UDP Cloudflare Tunnel Requests"
      direction       = "out"
      protocol        = "udp"
      port            = "7844"
      destination_ips = ["0.0.0.0/0", "::/0"]
    },
    {
      description     = "Allow Outbound SMTP Requests"
      direction       = "out"
      protocol        = "tcp"
      port            = "587"
      destination_ips = ["0.0.0.0/0", "::/0"]
    }
  ]
  
  enable_cert_manager = false

  dns_servers = [
    "1.1.1.1",
    "8.8.8.8",
    "2606:4700:4700::1111",
  ]

  export_values = true
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

output "k3s_token" {
  value     = module.kube-hetzner.k3s_token
  sensitive = true
}

Screenshots

No response

Platform

Mac

@treylade treylade added the bug Something isn't working label Oct 11, 2024
@treylade
Copy link
Author

treylade commented Oct 11, 2024

I assume that this issue might occur during automatic Kernel upgrade. When I reboot the stuck node, I can observe, that the nodes continue to automatically upgrade the kernel version.

@treylade treylade changed the title [Bug]: Nodes randomly booting into Maintenance-Mode [Bug]: Nodes randomly booting into maintenance-mode and stuck in NotReady state Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant