after some time workers fail with fork/exec resource temporarily unavailable #8894

mritalian · 2024-01-22T21:42:35Z

mritalian
Jan 22, 2024

After maybe a few weeks of running, my workers running in okd and with worker.kind=Deployment, fail sporadically throwing errors in builds like:

find or create container on worker prd-concourse-worker-7dbdc58b78-tk6vf: create COW volume: Get "/volumes-async/9d8074be-663c-4c85-7cdd-3287b41b5142": worker prd-concourse-worker-7dbdc58b78-tk6vf disappeared while trying to reach it
find or create container on worker prd-concourse-worker-7dbdc58b78-zgqp6: create COW volume: failed to create volume
run check: find or create container on worker prd-concourse-worker-7dbdc58b78-zgqp6: failed to create volume

At various stages of the build.

It's probably just one worker doing specific builds that has the problem, as I have affinity set to optimize performance:

      - value: >-
          limit-active-containers,limit-active-volumes,limit-active-tasks,volume-locality,fewest-build-containers
      - name: concourse.web.limitActiveTasks
        value: '2'

Still, it breaks all builds that gravitate to that worker, even though at that moment the 4 workers are largely idle.

Restarting the worker deployment fixes it. The logs for the deployment are a bit as follows, which basically mirrors the UI facing errors we see there. Doesn't really give much more detail.

214605	{"timestamp":"2024-01-22T21:06:13.704778754Z","level":"error","source":"atc","message":"atc.tracker-imb.run.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"build":"check","error":"failed to create volume","pipeline":"authorization-service.gitlab.merge-request","pre_build_id":1273122,"resource":"merge-request","session":"23.1273118.11","team":"mycompany-myapp-bin","volume":"9aa5369c-55ea-44d0-4def-d7ccb7f2a4c8"}}
214606	{"timestamp":"2024-01-22T21:06:13.710768899Z","level":"error","source":"atc","message":"atc.tracker-imb.run.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"build":"check","error":"failed to create volume","pipeline":"authcheck-service.gitlab.merge-request","pre_build_id":1273095,"resource":"merge-request","session":"23.1273090.11","team":"mycompany-myapp-bin","volume":"cfac15bf-c861-4016-7262-c68e2ee2e0fa"}}

Errors from the web deployment and atc are more exciting

209808	{"timestamp":"2024-01-22T21:06:43.611410299Z","level":"error","source":"atc","message":"atc.tracker-imb.run.failed-to-create-container-in-garden","data":{"build":"check","error":"starting task: new task: failed to start shim: start failed: : fork/exec /usr/local/concourse/bin/containerd-shim-runc-v2: resource temporarily unavailable: unknown","pipeline":"enry-myapp-hello-world.gitlab.project-factory","pre_build_id":1246920,"resource":"git-branches","session":"23.1246915","team":"mycompany-myapp-web"}}
209810	{"timestamp":"2024-01-22T21:06:43.617141195Z","level":"error","source":"atc","message":"atc.tracker-imb.run.failed-to-create-container-in-garden","data":{"build":"check","container":"a583fb7e-d27b-4849-79eb-8788591ed4a5","error":"starting task: new task: failed to start shim: start failed: : fork/exec /usr/local/concourse/bin/containerd-shim-runc-v2: resource temporarily unavailable: unknown","pipeline":"enry-myapp-hello-world.gitlab.project-factory","pre_build_id":1246920,"resource":"git-branches","session":"23.1246915","team":"mycompany-myapp-web"}}
209813	{"timestamp":"2024-01-22T21:06:43.646893973Z","level":"info","source":"atc","message":"atc.tracker-imb.run.errored","data":{"build":"check","error":"run check: find or create container on worker prd-concourse-worker-7dbdc58b78-zgqp6: starting task: new task: failed to start shim: start failed: : fork/exec /usr/local/concourse/bin/containerd-shim-runc-v2: resource temporarily unavailable: unknown","pipeline":"enry-myapp-hello-world.gitlab.project-factory","pre_build_id":1246920,"resource":"git-branches","session":"23.1246915","team":"mycompany-myapp-web"}}
209814	{"timestamp":"2024-01-22T21:06:43.678480696Z","level":"info","source":"atc","message":"atc.tracker-imb.run.errored","data":{"build":"check","error":"run check: find or create container on worker prd-concourse-worker-7dbdc58b78-zgqp6: starting task: new task: failed to start shim: start failed: : fork/exec /usr/local/concourse/bin/containerd-shim-runc-v2: resource temporarily unavailable: unknown","pipeline":"enry-myapp-hello-world.gitlab.project-factory","pre_build_id":1246920,"resource":"git-branches","session":"23.1246915","team":"mycompany-myapp-web"}}
209815	{"timestamp":"2024-01-22T21:06:43.679776910Z","level":"info","source":"atc","message":"atc.tracker-imb.run.errored","data":{"build":"check","error":"run check: start process: backend error: Exit status: 500, message: {\"Type\":\"\",\"Message\":\"proc start: OCI runtime exec failed: fork/exec /usr/local/concourse/bin/runc: resource temporarily unavailable: unknown\",\"Handle\":\"\",\"ProcessID\":\"\",\"Binary\":\"\"}\n","pipeline":"cron.yul01dvlscm01.develop","pre_build_id":1246993,"resource":"docker-src","session":"23.1246986","team":"summit-web-bin-spotlight"}}

So, I've only ever seen fork fail when you hit the process limit, or some resource constraint. Is this a correct assessment? How can I increase the ulimit? Is this a concourse or an OKD issue? I am on chart version 17.2.0.

I've seen similar issues but none that related to running in k8s.

Please advise, thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

after some time workers fail with fork/exec resource temporarily unavailable #8894

{{title}}

Replies: 0 comments

Select a reply

after some time workers fail with fork/exec resource temporarily unavailable #8894

mritalian Jan 22, 2024

Replies: 0 comments

mritalian
Jan 22, 2024