You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After maybe a few weeks of running, my workers running in okd and with worker.kind=Deployment, fail sporadically throwing errors in builds like:
find or create container on worker prd-concourse-worker-7dbdc58b78-tk6vf: create COW volume: Get "/volumes-async/9d8074be-663c-4c85-7cdd-3287b41b5142": worker prd-concourse-worker-7dbdc58b78-tk6vf disappeared while trying to reach it
find or create container on worker prd-concourse-worker-7dbdc58b78-zgqp6: create COW volume: failed to create volume
run check: find or create container on worker prd-concourse-worker-7dbdc58b78-zgqp6: failed to create volume
At various stages of the build.
It's probably just one worker doing specific builds that has the problem, as I have affinity set to optimize performance:
Still, it breaks all builds that gravitate to that worker, even though at that moment the 4 workers are largely idle.
Restarting the worker deployment fixes it. The logs for the deployment are a bit as follows, which basically mirrors the UI facing errors we see there. Doesn't really give much more detail.
214605 {"timestamp":"2024-01-22T21:06:13.704778754Z","level":"error","source":"atc","message":"atc.tracker-imb.run.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"build":"check","error":"failed to create volume","pipeline":"authorization-service.gitlab.merge-request","pre_build_id":1273122,"resource":"merge-request","session":"23.1273118.11","team":"mycompany-myapp-bin","volume":"9aa5369c-55ea-44d0-4def-d7ccb7f2a4c8"}}
214606 {"timestamp":"2024-01-22T21:06:13.710768899Z","level":"error","source":"atc","message":"atc.tracker-imb.run.find-or-create-cow-volume-for-container.failed-to-create-volume-in-baggageclaim","data":{"build":"check","error":"failed to create volume","pipeline":"authcheck-service.gitlab.merge-request","pre_build_id":1273095,"resource":"merge-request","session":"23.1273090.11","team":"mycompany-myapp-bin","volume":"cfac15bf-c861-4016-7262-c68e2ee2e0fa"}}
Errors from the web deployment and atc are more exciting
209808 {"timestamp":"2024-01-22T21:06:43.611410299Z","level":"error","source":"atc","message":"atc.tracker-imb.run.failed-to-create-container-in-garden","data":{"build":"check","error":"starting task: new task: failed to start shim: start failed: : fork/exec /usr/local/concourse/bin/containerd-shim-runc-v2: resource temporarily unavailable: unknown","pipeline":"enry-myapp-hello-world.gitlab.project-factory","pre_build_id":1246920,"resource":"git-branches","session":"23.1246915","team":"mycompany-myapp-web"}}
209810 {"timestamp":"2024-01-22T21:06:43.617141195Z","level":"error","source":"atc","message":"atc.tracker-imb.run.failed-to-create-container-in-garden","data":{"build":"check","container":"a583fb7e-d27b-4849-79eb-8788591ed4a5","error":"starting task: new task: failed to start shim: start failed: : fork/exec /usr/local/concourse/bin/containerd-shim-runc-v2: resource temporarily unavailable: unknown","pipeline":"enry-myapp-hello-world.gitlab.project-factory","pre_build_id":1246920,"resource":"git-branches","session":"23.1246915","team":"mycompany-myapp-web"}}
209813 {"timestamp":"2024-01-22T21:06:43.646893973Z","level":"info","source":"atc","message":"atc.tracker-imb.run.errored","data":{"build":"check","error":"run check: find or create container on worker prd-concourse-worker-7dbdc58b78-zgqp6: starting task: new task: failed to start shim: start failed: : fork/exec /usr/local/concourse/bin/containerd-shim-runc-v2: resource temporarily unavailable: unknown","pipeline":"enry-myapp-hello-world.gitlab.project-factory","pre_build_id":1246920,"resource":"git-branches","session":"23.1246915","team":"mycompany-myapp-web"}}
209814 {"timestamp":"2024-01-22T21:06:43.678480696Z","level":"info","source":"atc","message":"atc.tracker-imb.run.errored","data":{"build":"check","error":"run check: find or create container on worker prd-concourse-worker-7dbdc58b78-zgqp6: starting task: new task: failed to start shim: start failed: : fork/exec /usr/local/concourse/bin/containerd-shim-runc-v2: resource temporarily unavailable: unknown","pipeline":"enry-myapp-hello-world.gitlab.project-factory","pre_build_id":1246920,"resource":"git-branches","session":"23.1246915","team":"mycompany-myapp-web"}}
209815 {"timestamp":"2024-01-22T21:06:43.679776910Z","level":"info","source":"atc","message":"atc.tracker-imb.run.errored","data":{"build":"check","error":"run check: start process: backend error: Exit status: 500, message: {\"Type\":\"\",\"Message\":\"proc start: OCI runtime exec failed: fork/exec /usr/local/concourse/bin/runc: resource temporarily unavailable: unknown\",\"Handle\":\"\",\"ProcessID\":\"\",\"Binary\":\"\"}\n","pipeline":"cron.yul01dvlscm01.develop","pre_build_id":1246993,"resource":"docker-src","session":"23.1246986","team":"summit-web-bin-spotlight"}}
So, I've only ever seen fork fail when you hit the process limit, or some resource constraint. Is this a correct assessment? How can I increase the ulimit? Is this a concourse or an OKD issue? I am on chart version 17.2.0.
I've seen similar issues but none that related to running in k8s.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
After maybe a few weeks of running, my workers running in okd and with worker.kind=Deployment, fail sporadically throwing errors in builds like:
At various stages of the build.
It's probably just one worker doing specific builds that has the problem, as I have affinity set to optimize performance:
Still, it breaks all builds that gravitate to that worker, even though at that moment the 4 workers are largely idle.
Restarting the worker deployment fixes it. The logs for the deployment are a bit as follows, which basically mirrors the UI facing errors we see there. Doesn't really give much more detail.
Errors from the web deployment and atc are more exciting
So, I've only ever seen fork fail when you hit the process limit, or some resource constraint. Is this a correct assessment? How can I increase the ulimit? Is this a concourse or an OKD issue? I am on chart version 17.2.0.
I've seen similar issues but none that related to running in k8s.
Please advise, thanks,
Beta Was this translation helpful? Give feedback.
All reactions