Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runners created with actions-runner-controller in we have a lot of pods with errors: "Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?" #3257

Open
7 tasks done
viniciusesteter opened this issue Feb 1, 2024 · 17 comments
Labels
bug Something isn't working community Community contribution needs triage Requires review from the maintainers

Comments

@viniciusesteter
Copy link

viniciusesteter commented Feb 1, 2024

Checks

Controller Version

latest

Helm Chart Version

0.27.6

CertManager Version

1.13.1

Deployment Method

Helm

cert-manager installation

Installed ok by Chart.yaml

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1 
kind: RunnerDeployment
metadata:
  {{- if hasSuffix "-dev" .Release.Namespace  }}
  name: {{ .Values.runnerDeploymentDev.name }}
  namespace: {{ .Release.Namespace }}
  {{- end  }}
  {{- if hasSuffix "-prd" .Release.Namespace }}
  name: {{ .Values.runnerDeploymentPrd.name }}
  namespace: {{ .Release.Namespace }}
  {{- end }}
spec:
  {{- if hasSuffix "-dev" .Release.Namespace  }}
  replicas: {{ .Values.runnerDeploymentDev.replicas }}
  {{- end }}
  {{- if hasSuffix "-prd" .Release.Namespace }}
  replicas: {{ .Values.runnerDeploymentPrd.replicas }}
  {{- end }}
  template:
    spec:
      {{- if hasSuffix "-dev" .Release.Namespace  }}
      image: {{ .Values.runnerDeploymentDev.image }} ## Alterar para repositório de DEV
      {{- end }}
      {{- if hasSuffix "-prd" .Release.Namespace }}
      image: {{ .Values.runnerDeploymentPrd.image }} ## Alterar para repositório de Prd
      {{- end }}
      organization: company-a
      {{- if hasSuffix "-dev" .Release.Namespace  }}
      labels:
        {{- range .Values.runnerDeploymentDev.labels }}
        {{ "-" }} {{ . }}
        {{- end }}
      {{- end }}
      {{- if hasSuffix "-prd" .Release.Namespace }}
      labels:
        {{- range .Values.runnerDeploymentPrd.labels }}
        {{ "-" }} {{ . }}
        {{- end }}
      {{- end }}
      env:
        - name: teste
          {{- if hasSuffix "-dev" .Release.Namespace  }}
          value: a
          {{- end }}
          {{- if hasSuffix "-prd" .Release.Namespace  }}
          value: b
          {{- end }}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  {{- if hasSuffix "-dev" .Release.Namespace  }}
  name: {{ .Values.HpaDev.name }}
  namespace: {{ .Release.Namespace }}
  {{- end }}
  {{- if hasSuffix "-prd" .Release.Namespace }} 
  name: {{ .Values.HpaPrd.name }}
  namespace: {{ .Release.Namespace }}
  {{- end }}
spec:
  scaleTargetRef:
    kind: RunnerDeployment
    {{- if hasSuffix "-dev" .Release.Namespace  }}
    name: {{ .Values.HpaDev.nameRunner }}
    {{- end }}
    {{- if hasSuffix "-prd" .Release.Namespace }} 
    name: {{ .Values.HpaPrd.nameRunner }}
    {{- end }}
  {{- if hasSuffix "-dev" .Release.Namespace  }}
  minReplicas: {{ .Values.HpaDev.minReplicas }}
  maxReplicas: {{ .Values.HpaDev.maxReplicas }}
  scaleDownDelaySecondsAfterScaleOut: {{ .Values.HpaDev.scaleDownDelaySecondsAfterScaleOut }}
  metrics:
  - type: {{ .Values.HpaDev.type }}
    scaleUpThreshold: '{{ .Values.HpaDev.scaleUpThreshold }}'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '{{ .Values.HpaDev.scaleDownThreshold }}'  # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpAdjustment: {{ .Values.HpaDev.scaleUpAdjustment }}        # The scale up runner count added to desired count
    scaleDownAdjustment: {{ .Values.HpaDev.scaleDownAdjustment }}     # The scale down runner count subtracted from the desired count
    # Podemos usar os parametros de Factor ou Adjustment acima, mas não os dois juntos.
    # scaleUpFactor: {{ .Values.HpaDev.scaleUpFactor }}        # The scale up runner count added to desired count
    # scaleDownFactor: {{ .Values.HpaDev.scaleDownFactor }}     # The scale down runner count subtracted from the desired count
    
  {{- end }}
  {{- if hasSuffix "-prd" .Release.Namespace }}
  minReplicas: {{ .Values.HpaPrd.minReplicas }}
  maxReplicas: {{ .Values.HpaPrd.maxReplicas }}
  scaleDownDelaySecondsAfterScaleOut: {{ .Values.HpaPrd.scaleDownDelaySecondsAfterScaleOut }}
  metrics:
  - type: {{ .Values.HpaPrd.type }}
    scaleUpThreshold: '{{ .Values.HpaPrd.scaleUpThreshold }}'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '{{ .Values.HpaPrd.scaleDownThreshold }}'  # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpAdjustment: {{ .Values.HpaPrd.scaleUpAdjustment }}       # The scale up runner count added to desired count
    scaleDownAdjustment: {{ .Values.HpaPrd.scaleDownAdjustment }}     # The scale down runner count subtracted from the desired count
    # Podemos usar os parametros de Factor ou Adjustment acima, mas não os dois juntos.
    # scaleUpFactor: {{ .Values.HpaPrd.scaleUpFactor }}       # The scale up runner count added to desired count
    # scaleDownFactor: {{ .Values.HpaPrd.scaleDownFactor }}     # The scale down runner count subtracted from the desired count
  {{- end }}

To Reproduce

1. Create a runner
2. Watch runner container logs

Describe the bug

I'm using a GKE version: 1.26.10-gke.1101000. In my Dockerfile, I'm using: FROM summerwind/actions-runner:latest.

In values.yaml, I'm using:

image:
  repository: "summerwind/actions-runner-controller"
  actionsRunnerRepositoryAndTag: "summerwind/actions-runner:latest"
  dindSidecarRepositoryAndTag: "docker:dind"
  pullPolicy: IfNotPresent
  # The default image-pull secrets name for self-hosted runner container.
  # It's added to spec.ImagePullSecrets of self-hosted runner pods. 
  actionsRunnerImagePullSecrets: []

But when deploy is done, in GKE and get a lot of pods, with error: "Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?"

The pods are restarting with error in container "docker" with this message: "Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?". It died and start new with the same problem.

I've already follow this issue: 2490, but doesn't work.

Could help me please?

Describe the expected behavior

Doesn't get this situation with error, and running normally.

Whole Controller Logs

Nothing logs in controller with errors.

Whole Runner Pod Logs

In pods I got the same error: "Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?"

I've tried to change $DOCKER_HOST to DOCKER_HOST="tcp://localhost:2375", but when I open the running that I can, a do echo $DOCKER_HOST and my response is: unix:///run/docker.sock. I don't think this can be the error.

Additional Context

No response

@viniciusesteter viniciusesteter added bug Something isn't working community Community contribution needs triage Requires review from the maintainers labels Feb 1, 2024
Copy link
Contributor

github-actions bot commented Feb 1, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@asafhm
Copy link

asafhm commented Feb 2, 2024

I think I'm facing the same issue as well. I it happened to me a month ago and it went away on its own but I couldn't figure it out.
The logs of the docker container in the runner pods all emit this:

cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
time="2024-02-02T14:36:08.349688382Z" level=info msg="Starting up"
time="2024-02-02T14:36:08.350965386Z" level=info msg="containerd not running, starting managed containerd"
time="2024-02-02T14:36:08.351685472Z" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=30
time="2024-02-02T14:36:08.373648316Z" level=info msg="starting containerd" revision=7c3aca7a610df76212171d200ca3811ff6096eb8 version=v1.7.13
time="2024-02-02T14:36:08.392594430Z" level=info msg="loading plugin \"io.containerd.event.v1.exchange\"..." type=io.containerd.event.v1
time="2024-02-02T14:36:08.392636213Z" level=info msg="loading plugin \"io.containerd.internal.v1.opt\"..." type=io.containerd.internal.v1
time="2024-02-02T14:36:08.392898621Z" level=info msg="loading plugin \"io.containerd.warning.v1.deprecations\"..." type=io.containerd.warning.v1
time="2024-02-02T14:36:08.392917009Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.392969909Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." error="no scratch file generator: skip plugin" type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.392983451Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.392992588Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
time="2024-02-02T14:36:08.393000482Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.393063886Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.393264992Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.397803212Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"ip: can't find device 'aufs'\\nmodprobe: can't change directory to '/lib/modules': No such file or directory\\n\"): skip plugin" type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.397832591Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.398031842Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.398046545Z" level=info msg="loading plugin \"io.containerd.content.v1.content\"..." type=io.containerd.content.v1
time="2024-02-02T14:36:08.398150598Z" level=info msg="loading plugin \"io.containerd.metadata.v1.bolt\"..." type=io.containerd.metadata.v1
time="2024-02-02T14:36:08.398212774Z" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
time="2024-02-02T14:36:08.398230252Z" level=info msg="metadata content store policy set" policy=shared
time="2024-02-02T14:36:08.445396814Z" level=info msg="loading plugin \"io.containerd.gc.v1.scheduler\"..." type=io.containerd.gc.v1
time="2024-02-02T14:36:08.445471715Z" level=info msg="loading plugin \"io.containerd.differ.v1.walking\"..." type=io.containerd.differ.v1
time="2024-02-02T14:36:08.445500464Z" level=info msg="loading plugin \"io.containerd.lease.v1.manager\"..." type=io.containerd.lease.v1
time="2024-02-02T14:36:08.445576869Z" level=info msg="loading plugin \"io.containerd.streaming.v1.manager\"..." type=io.containerd.streaming.v1
time="2024-02-02T14:36:08.445618783Z" level=info msg="loading plugin \"io.containerd.runtime.v1.linux\"..." type=io.containerd.runtime.v1
time="2024-02-02T14:36:08.445781283Z" level=info msg="loading plugin \"io.containerd.monitor.v1.cgroups\"..." type=io.containerd.monitor.v1
time="2024-02-02T14:36:08.446305234Z" level=info msg="loading plugin \"io.containerd.runtime.v2.task\"..." type=io.containerd.runtime.v2
time="2024-02-02T14:36:08.446589823Z" level=info msg="loading plugin \"io.containerd.runtime.v2.shim\"..." type=io.containerd.runtime.v2
time="2024-02-02T14:36:08.446619509Z" level=info msg="loading plugin \"io.containerd.sandbox.store.v1.local\"..." type=io.containerd.sandbox.store.v1
time="2024-02-02T14:36:08.446666322Z" level=info msg="loading plugin \"io.containerd.sandbox.controller.v1.local\"..." type=io.containerd.sandbox.controller.v1
time="2024-02-02T14:36:08.446705283Z" level=info msg="loading plugin \"io.containerd.service.v1.containers-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446759587Z" level=info msg="loading plugin \"io.containerd.service.v1.content-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446780137Z" level=info msg="loading plugin \"io.containerd.service.v1.diff-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446806016Z" level=info msg="loading plugin \"io.containerd.service.v1.images-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446835246Z" level=info msg="loading plugin \"io.containerd.service.v1.introspection-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446858358Z" level=info msg="loading plugin \"io.containerd.service.v1.namespaces-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446883787Z" level=info msg="loading plugin \"io.containerd.service.v1.snapshots-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446902822Z" level=info msg="loading plugin \"io.containerd.service.v1.tasks-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446932581Z" level=info msg="loading plugin \"io.containerd.grpc.v1.containers\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.446961217Z" level=info msg="loading plugin \"io.containerd.grpc.v1.content\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.446981273Z" level=info msg="loading plugin \"io.containerd.grpc.v1.diff\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.446997347Z" level=info msg="loading plugin \"io.containerd.grpc.v1.events\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447016883Z" level=info msg="loading plugin \"io.containerd.grpc.v1.images\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447036957Z" level=info msg="loading plugin \"io.containerd.grpc.v1.introspection\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447052236Z" level=info msg="loading plugin \"io.containerd.grpc.v1.leases\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447070724Z" level=info msg="loading plugin \"io.containerd.grpc.v1.namespaces\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447087998Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandbox-controllers\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447107438Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandboxes\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447123714Z" level=info msg="loading plugin \"io.containerd.grpc.v1.snapshots\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447148047Z" level=info msg="loading plugin \"io.containerd.grpc.v1.streaming\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447166979Z" level=info msg="loading plugin \"io.containerd.grpc.v1.tasks\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447198670Z" level=info msg="loading plugin \"io.containerd.transfer.v1.local\"..." type=io.containerd.transfer.v1
time="2024-02-02T14:36:08.447442412Z" level=info msg="loading plugin \"io.containerd.grpc.v1.transfer\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447530474Z" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447564804Z" level=info msg="loading plugin \"io.containerd.internal.v1.restart\"..." type=io.containerd.internal.v1
time="2024-02-02T14:36:08.447645525Z" level=info msg="loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." type=io.containerd.tracing.processor.v1
time="2024-02-02T14:36:08.447680777Z" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
time="2024-02-02T14:36:08.447698141Z" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
time="2024-02-02T14:36:08.447725487Z" level=info msg="skipping tracing processor initialization (no tracing plugin)" error="no OpenTelemetry endpoint: skip plugin"
time="2024-02-02T14:36:08.447855630Z" level=info msg="loading plugin \"io.containerd.grpc.v1.healthcheck\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447879733Z" level=info msg="loading plugin \"io.containerd.nri.v1.nri\"..." type=io.containerd.nri.v1
time="2024-02-02T14:36:08.447899819Z" level=info msg="NRI interface is disabled by configuration."
time="2024-02-02T14:36:08.448219892Z" level=info msg=serving... address=/var/run/docker/containerd/containerd-debug.sock
time="2024-02-02T14:36:08.448288542Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock.ttrpc
time="2024-02-02T14:36:08.448345253Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock
time="2024-02-02T14:36:08.448380520Z" level=info msg="containerd successfully booted in 0.075643s"
time="2024-02-02T14:36:11.271697198Z" level=info msg="Loading containers: start."
time="2024-02-02T14:36:11.366705804Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
time="2024-02-02T14:36:11.367239329Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
time="2024-02-02T14:36:11.367283663Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
failed to start daemon: Error initializing network controller: error creating default "bridge" network: Failed to Setup IP tables: Unable to enable NAT rule:  (iptables failed: iptables --wait -t nat -I POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE: Warning: Extension MASQUERADE revision 0 not supported, missing kernel module?
iptables v1.8.10 (nf_tables):  CHAIN_ADD failed (No such file or directory): chain POSTROUTING
 (exit status 4))

I'm using GKE version 1.28, with the default dind container image in the helm chart.

@asafhm
Copy link

asafhm commented Feb 4, 2024

Wonder if this has anything to do with the recent fix GKE has released for CVE-2023-6817

@viniciusesteter
Copy link
Author

Wonder if this has anything to do with the recent fix GKE has released for CVE-2023-6817

I don’t think so. Because this errors in my cluster happening since 4/5 months ago.

@asafhm
Copy link

asafhm commented Feb 4, 2024

Ended up following this workaround which made dind work again: #3159 (comment)

Still think dind needs to address this.

@viniciusesteter
Copy link
Author

But where manifest in helm, Can I put this arguments? Because I don't have a argument that has container docker. And I actually used image: summerwind/actions-runner:latest in my Dockerfile and summerwind/actions-runner:latest in my values.yaml from deployment.yaml helm.

@asafhm
Copy link

asafhm commented Feb 5, 2024

I agree, that's tricky.
I actually transitioned to the new runner-scale-set operator, where you can control the pod template, including the dind sidecar container.

@jctrouble
Copy link

I agree, that's tricky. I actually transitioned to the new runner-scale-set operator, where you can control the pod template, including the dind sidecar container.

@asafhm Would you be willing to share the snippet of your values.yaml (or helm command) where you specified the dind container with the workaround environment variable?

@asafhm
Copy link

asafhm commented Feb 11, 2024

@jctrouble Here's a portion of the values.yaml I use for the gha-runner-scale-set chart:

template:
  spec:
    initContainers:
      - name: init-dind-externals
        image: ghcr.io/actions/actions-runner:latest
        command:
          ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
        volumeMounts:
          - name: dind-externals
            mountPath: /home/runner/tmpDir
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        imagePullPolicy: Always
        command: ["/home/runner/run.sh"]
        resources:
          limits:
            cpu: 400m
            memory: 512Mi
          requests:
            cpu: 200m
            memory: 256Mi
        env:
          - name: DOCKER_HOST
            value: unix:///run/docker/docker.sock
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /run/docker
            readOnly: true
      - name: dind
        image: docker:dind
        args:
          - dockerd
          - --host=unix:///run/docker/docker.sock
          - --group=$(DOCKER_GROUP_GID)
        env:
          - name: DOCKER_GROUP_GID
            value: "123"
          # TODO: Once this issue is fixed (https://github.com/actions/actions-runner-controller/issues/3159),
          # we can switch to containerMode.type=dind and keep only the "runner" container specs and remove the "dind" container, init containers and volumes parts from the values.
          - name: DOCKER_IPTABLES_LEGACY
            value: "1"
        securityContext:
          privileged: true
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /run/docker
          - name: dind-externals
            mountPath: /home/runner/externals
    volumes:
      - name: work
        emptyDir: {}
      - name: dind-sock
        emptyDir: {}
      - name: dind-externals
        emptyDir: {}

The reason I added a lot more here than just the env var part is because the docs specify that if you need to modify something in the dind container, you have to have all its configuration in your values file and edit it there.
Not a clean solution yet I'm afraid, but at least it works well.

@rekha-prakash-maersk
Copy link

rekha-prakash-maersk commented Feb 11, 2024

Hi @asafhm I have tried your workaround but still facing the same issue. the issue started after upgrading new scale-set to its latest version. any other options to try ? Thanks!

The runner is starting fine, but the error appears if I run a workflow which has docker build step., so I am a bit clueless!

@asafhm
Copy link

asafhm commented Feb 12, 2024

@rekha-prakash-maersk Did you verify that runner pods that come up have said env var in the dind container spec?
Also did you check the dind container logs? Cannot connect to the Docker daemon at unix:///run/docker.sock can result from a number of reasons.

@rekha-prakash-maersk
Copy link

Hi @asafhm , I found that dind container needed more resources for the docker build that was executed. thanks for the help!

@sravula84
Copy link

we are facing similar issue -
time="2024-04-11T22:08:59.214409763Z" level=info msg="Loading containers: start." │
│ time="2024-04-11T22:08:59.337082693Z" level=info msg="stopping event stream following graceful shutdown" error="" module=libcontainerd namespace=moby │
│ time="2024-04-11T22:08:59.337523532Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd │
│ time="2024-04-11T22:08:59.337584737Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby │
│ failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: unable to add return rule in DOCKER-ISOLATION-STAGE-1 chain: (iptables failed: iptables --wait -A DOCKER-ISOLATION-STAGE-1 -j RETURN: iptables v1.8.10 (nf_tables): RULE_APPEND failed (No such file or directory): rule in chain DOCKER-ISOLATION-STAGE-1 │
│ (exit status 4)) │
│ Stream closed EOF for arc-runners/arc-runner-set-qwwpf-runner-nffvc (dind)

any suggestion @rekha-prakash-maersk @asafhm

@marc-barry
Copy link

I'm have the same issue on Google Cloud Platform on GKE when simply using:

containerMode:
   type: "dind"

I haven't adjusted any of the values.

@rekha-prakash-maersk
Copy link

Hi @marc-barry , I have allocated more resource to CPU and memory for dind container like below, which resolved the issue for me

- name: dind
      image: docker:dind
      args:
        - dockerd
        - --host=unix:///run/docker/docker.sock
        - --group=$(DOCKER_GROUP_GID)
      env:
        - name: DOCKER_GROUP_GID
          value: "123"
      resources:
        requests:
          memory: "500Mi"
          cpu: "300m"
        limits:
          memory: "500Mi"
          cpu: "300m"
      securityContext:
        privileged: true

@marc-barry
Copy link

@rekha-prakash-maersk thanks for that information. We've decided to move away from using runners on Kubernetes as the documentation isn't yet fully complete and we don't want to spend our time fighting infrastructure problems like we are experiencing with this controller. The concepts and ideas are pretty sound but the execution is challenging. For the time being, we have gone to bare VMs running Debian on GCP on both t2a-standard-x for our Arm64 builds and t2d-standard-x for our Amd64 builds. We then have an image template that simply has Docker installed on the machine and the runner started with Systemd. I was able to get this all running in under an hour versus the challenges faced with the Actions Runner Controller.

GitHub Actions is super convenient and that's why we use it. But if I find the need to bring our runners more and more then I'll switch us to Buildkite as I feel like their BYOC is a bit more developed (and I have a lot of experience with it).

@sravula84
Copy link

@rekha-prakash-maersk do we need to comment below sections ?
containerMode:
type: "dind"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community Community contribution needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

6 participants