[Bug]: Nodes randomly booting into maintenance-mode and stuck in NotReady state #1499

treylade · 2024-10-11T10:55:11Z

Description

Hello,

My kube.tf config generally works perfectly fine and I am very pleased with the solution.
Recently I observed, that randomly single nodes go into state "NotReady,SchedulingDisabled". It is usually just a single node.
I do not have SSH-Access to that node anymore since the network is unreachable. A manual reboot would solve the issue but this can not be the permanent solution. I have to rely on my nodes staying in "Ready" state. The issue is not being automatically healed.

I reached out to Hetzner support to investigate if this is an issue on their side. The answer I got was, that the node did boot into "Maintenance mode" which usually happens due to kernel- or filesystem-related issues.

Kubernetes version: v1.30.5+k3s1
Kernel version: 6.11.0-1-default

I am using Longhorn as storage provider.

Do you have any experience with that issue?

Best regards,
Lars

Kube.tf file

locals {
  num_control_planes = 3
  control_plane_type = "cax21"
  num_workers_sm = 12
  worker_type_sm = "cax21"
  num_workers_md = 4
  worker_type_md = "cax31"
}

variable "OP_SERVICE_ACCOUNT_TOKEN" {
  description = "The 1Password service account token"
  type        = string
  sensitive   = true
}

variable "OP_VAULT_UUID" {
  description = "The 1Password vault uuid"
  type        = string
  sensitive   = true
}

provider "onepassword" {
  op_cli_path           = "/opt/homebrew/bin/op"
  service_account_token = var.OP_SERVICE_ACCOUNT_TOKEN
}

data "onepassword_vault" "secrets_vault" {
  uuid = var.OP_VAULT_UUID
}

data "onepassword_item" "hcloud_token" {
  vault    = data.onepassword_vault.secrets_vault.uuid
  title    = "Hetzner Cloud API Token"
}

data "onepassword_item" "internal_s3_credential" {
  vault = data.onepassword_vault.secrets_vault.uuid
  title = "S3 Credential"
}

provider "hcloud" {
  token = data.onepassword_item.hcloud_token.credential
}

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.43.0"
    }
    onepassword = {
      source = "1Password/onepassword"
      version = ">= 2.1.2"
    }
  }
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }

  hcloud_token = data.onepassword_item.hcloud_token.credential
  source = "kube-hetzner/kube-hetzner/hcloud"
  
  ssh_public_key  = file("~/.ssh/production_deploy_key.pub")
  ssh_private_key = file("~/.ssh/production_deploy_key")

  network_region = "eu-central"

  control_plane_nodepools = [
    {
      name        = "control-plane",
      server_type = local.control_plane_type,
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = local.num_control_planes
    }
  ]

  agent_nodepools = [
    {
      name                 = "worker-sm-1",
      server_type          = local.worker_type_sm,
      location             = "fsn1",
      labels               = [],
      taints               = [],
      count                = local.num_workers_sm / 2,
      longhorn_volume_size = 0
      placement_group      = "worker-sm-1"
    },
    {
      name                 = "worker-sm-2",
      server_type          = local.worker_type_sm,
      location             = "fsn1",
      labels               = [],
      taints               = [],
      count                = local.num_workers_sm / 2,
      longhorn_volume_size = 0
      placement_group      = "worker-sm-2"
    },
    {
      name                 = "worker-md",
      server_type          = local.worker_type_md,
      location             = "fsn1",
      labels               = [],
      taints               = [],
      count                = local.num_workers_md,
      longhorn_volume_size = 0
      placement_group      = "worker-md"
    }
  ]

  control_planes_custom_config = {
    etcd-expose-metrics = true
  }

  load_balancer_type = false

  etcd_s3_backup = {
    etcd-s3-endpoint        = "s3.eu-central-3.ionoscloud.com"
    etcd-s3-access-key      = data.onepassword_item.internal_s3_credential.username
    etcd-s3-secret-key      = data.onepassword_item.internal_s3_credential.credential
    etcd-s3-bucket          = "etcd-backup"
    etcd-s3-folder          = "production"
  }

  ingress_controller = "none"

  enable_metrics_server = false
  enable_local_storage = true

  automatically_upgrade_k3s = true
  automatically_upgrade_os  = true
  initial_k3s_channel       = "stable"

  cluster_name = "production-k3s"
  use_cluster_name_in_node_name = false

  extra_firewall_rules = [
    {
      description     = "Allow Outbound UDP Cloudflare Tunnel Requests"
      direction       = "out"
      protocol        = "udp"
      port            = "7844"
      destination_ips = ["0.0.0.0/0", "::/0"]
    },
    {
      description     = "Allow Outbound SMTP Requests"
      direction       = "out"
      protocol        = "tcp"
      port            = "587"
      destination_ips = ["0.0.0.0/0", "::/0"]
    }
  ]
  
  enable_cert_manager = false

  dns_servers = [
    "1.1.1.1",
    "8.8.8.8",
    "2606:4700:4700::1111",
  ]

  export_values = true
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

output "k3s_token" {
  value     = module.kube-hetzner.k3s_token
  sensitive = true
}

Screenshots

No response

Platform

Mac

treylade · 2024-10-11T11:00:56Z

I assume that this issue might occur during automatic Kernel upgrade. When I reboot the stuck node, I can observe, that the nodes continue to automatically upgrade the kernel version.

treylade added the bug Something isn't working label Oct 11, 2024

treylade changed the title ~~[Bug]: Nodes randomly booting into Maintenance-Mode~~ [Bug]: Nodes randomly booting into maintenance-mode and stuck in NotReady state Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Nodes randomly booting into maintenance-mode and stuck in NotReady state #1499

[Bug]: Nodes randomly booting into maintenance-mode and stuck in NotReady state #1499

treylade commented Oct 11, 2024

treylade commented Oct 11, 2024 •

edited

Loading

[Bug]: Nodes randomly booting into maintenance-mode and stuck in NotReady state #1499

[Bug]: Nodes randomly booting into maintenance-mode and stuck in NotReady state #1499

Comments

treylade commented Oct 11, 2024

Description

Kube.tf file

Screenshots

Platform

treylade commented Oct 11, 2024 • edited Loading

treylade commented Oct 11, 2024 •

edited

Loading