Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying Runner Scale Sets with dind mode for ARC stuck at pending stage without spwaned pod #3451

Closed
4 tasks done
ajisetyoko opened this issue Apr 18, 2024 · 6 comments
Closed
4 tasks done
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested

Comments

@ajisetyoko
Copy link

ajisetyoko commented Apr 18, 2024

Checks

Controller Version

0.8.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install the action-runner-controller by HELM


helm install arc --namespace="arc-systems" --create-namespace  actions-runner-controller/
# In Chart.yaml under actions-runner-controller folder
apiVersion: v2
type: application
name: actions-runner-controller
version: 1.0.0
dependencies:
  - name: gha-runner-scale-set-controller
    version: "0.8.3"
    repository: "oci://ghcr.io/actions/actions-runner-controller-charts"
  1. Install runner scale-set (without dind) - say as runner1
# Install by terraform
# main.tf
resource "helm_release" "scale-set-runner-ubuntu" {
  name        = "runner-tf-test1"
  namespace   = "runner-tf"
  repository  = "oci://ghcr.io/actions/actions-runner-controller-charts"
  version     = "0.8.3"
  chart       = "gha-runner-scale-set"
  
  values = [
    "${file("values.yaml")}"
  ]

  set {
    name  = "githubConfigSecret"
    value = var.githubConfigSecret
  }

  set {
    name = "githubConfigUrl"
    value = "https://github.com/org"
  }

}

# values.yaml
template:
  spec:
    containers:
      - name: runner
        image: ghcr.io/quipper/actions-runner:2.314.1-ubuntu20
        command: ["/home/runner/run.sh"]
  1. Test runner1 in GITHUB action (work fine)
  2. Create a new runner scale set with dind type of container
  3. Runner is successfully registered, but job cannot be assigned

Describe the bug

Github is successfully connect to the cluster, also successfully "do something", however, it is stuck in pending state.

kubectl api-resources --verbs=list --namespaced -o name   | xargs -n 1 kubectl get --show-kind --ignore-not-found -n runner1-dind
NAME                         DATA   AGE
configmap/kube-root-ca.crt   1      17d
NAME                                              TYPE                 DATA   AGE
secret/pre-defined-secret                         Opaque               1      10h
secret/runner-ubuntu-dind-9xnzc-runner-bjnc9      Opaque               1      70m
secret/sh.helm.release.v1.runner-ubuntu-dind.v1   helm.sh/release.v1   1      40h
secret/sh.helm.release.v1.runner-ubuntu-dind.v2   helm.sh/release.v1   1      10h
NAME                                                     SECRETS   AGE
serviceaccount/default                                   0         17d
serviceaccount/runner-ubuntu-dind-gha-rs-no-permission   0         40h
NAME                                                         MINIMUM RUNNERS   MAXIMUM RUNNERS   CURRENT RUNNERS   STATE   PENDING RUNNERS   RUNNING RUNNERS   FINISHED RUNNERS   DELETING RUNNERS
autoscalingrunnerset.actions.github.com/runner-ubuntu-dind                                       1                         1                                                      
NAME                                                                       GITHUB CONFIG URL             RUNNERID   STATUS   JOBREPOSITORY   JOBWORKFLOWREF   WORKFLOWRUNID   JOBDISPLAYNAME   MESSAGE   AGE
ephemeralrunner.actions.github.com/runner-ubuntu-dind-9xnzc-runner-bjnc9   https://github.com/org   72                                                                                              70m
NAME                                                             DESIREDREPLICAS   CURRENTREPLICAS   PENDING RUNNERS   RUNNING RUNNERS   FINISHED RUNNERS   DELETING RUNNERS
ephemeralrunnerset.actions.github.com/runner-ubuntu-dind-9xnzc   1                 1                 1                                                      
NAME                                                                         ROLE                                        AGE
rolebinding.rbac.authorization.k8s.io/runner-ubuntu-dind-899d759f-listener   Role/runner-ubuntu-dind-899d759f-listener   10h
rolebinding.rbac.authorization.k8s.io/runner-ubuntu-dind-gha-rs-manager      Role/runner-ubuntu-dind-gha-rs-manager      10h
NAME                                                                  CREATED AT
role.rbac.authorization.k8s.io/runner-ubuntu-dind-899d759f-listener   2024-04-17T13:40:32Z
role.rbac.authorization.k8s.io/runner-ubuntu-dind-gha-rs-manager      2024-04-17T13:39:03Z
kubectl  -n arc-systems get po
NAME                                              READY   STATUS    RESTARTS   AGE
arc-gha-rs-controller-5894784bf6-m6qtp            1/1     Running   0          41d
runner-ubuntu-dind-899d759f-listener              1/1     Running   0          10h
image

Describe the expected behavior

I want to run a job as a container in GitHub Actions, and from my understanding, this approach should meet my needs. I have followed the official GitHub ARC setup documentation, but I've encountered an issue that I cannot debug. Any assistance you can provide would be greatly appreciated.

Additional Context

# values.yaml

# same with step 2 except I run in different namespace and the value of Values.yaml
containerMode:
  type: "dind" 
  • I use GKE anyway.

Controller Logs

024-04-17T23:29:56Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 5}
2024-04-17T23:30:46Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 5}
2024-04-17T23:30:47Z    INFO    listener-app.listener   Message queue token is expired during GetNextMessage, refreshing...
2024-04-17T23:30:47Z    INFO    listener-app    refreshing token        {"githubConfigUrl": "https://github.com/org"}
2024-04-17T23:30:47Z    INFO    listener-app    getting runner registration token       {"registrationTokenURL": "https://api.github.com/orgs/org/actions/runners/registration-token"}
2024-04-17T23:30:47Z    INFO    listener-app    getting Actions tenant URL and JWT      {"registrationURL": "https://api.github.com/actions/runner-registration"}
2024-04-17T23:30:48Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 5}
2024-04-17T23:31:38Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 5}
2024-04-17T23:32:28Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 5}

Runner Pod Logs

No logs I can extract for this problem right now, (or maybe I dont know how to extract it). Pls let me know how can I extract it, and I will share it.
@ajisetyoko ajisetyoko added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Apr 18, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic
Copy link
Member

Hey @ajisetyoko,

Please provide the controller log. The log you provided is from the listener. I would also like to ask you to inspect if the spec of the ephemeral runner is correct, and to check with kubectl describe TARGET_POD shows any indication of an error for the pod that is failing to start.

@nikola-jokic nikola-jokic added question Further information is requested and removed bug Something isn't working needs triage Requires review from the maintainers labels Apr 19, 2024
@xpuska513
Copy link

Hi @nikola-jokic , I have exactly same issue, using our inhouse docker images for runners. After some digging I found out that dind container fails with this message in logs:

time="2024-04-19T10:38:35.936668364Z" level=warning msg="Running modprobe bridge br_netfilter failed with message: ip: can't find device 'bridge'\nbridge                307200  0 \nstp                    16384  1 bridge\nllc                    16384  2 bridge,stp\nip: can't find device 'br_netfilter'\nbr_netfilter           36864  0 \nbridge                307200  1 br_netfilter\nmodprobe: can't change directory to '/lib/modules': No such file or directory\n, error: exit status 1"

@xpuska513
Copy link

I think it might be related to #3257 , since after applying fix suggested by this author everything started to work as expected.

@ajisetyoko
Copy link
Author

Thanks for your help; I found the culprit. The issue arises from using GKE Autopilot, as evidenced by the log in the controller. Although I want to continue using Autopilot, changing the privileged setting to privileged=false hasn't resolved the issue. Nonetheless, I believe we can close this issue now.

@nikola-jokic
Copy link
Member

Closing this one since it is not related to ARC itself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants