Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip invalid link path during gc #4343

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

microyahoo
Copy link
Contributor

Signed-off-by: Liang Zheng [email protected]

Copy link
Member

@milosgajdos milosgajdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what is it you are trying to achieve with this PR?

Specifically, can you give us an example of when a stored registry link contains an invalid digest? That'd be mildly worrying, tbh

@microyahoo
Copy link
Contributor Author

Can you describe what is it you are trying to achieve with this PR?

Specifically, can you give us an example of when a stored registry link contains an invalid digest? That'd be mildly worrying, tbh

hi @milosgajdos, we use Harbor as our image repository, and Harbor uses Registry to manage images underneath. I've noticed that after running for a while, Harbor tends to leave behind a lot of residual data, as described in the #4330. I tried running garbage collection (GC) to reclaim resources, but encountered errors. It turns out that there are many invalid link files in the Registry. These link files are either empty or have invalid digests, which directly cause the GC to interrupt. Actually, we can skip these invalid link files instead of directly interrupting the GC.

~/go/src/github.com/distribution/distribution (link ✗) bin/registry garbage-collect cmd/registry/config-dev.yml -m
.......
deeproute-all/docker-repo/data-ocean/nginx
deeproute-all/docker-repo/data-ocean/prod/data-ocean-front-end
deeproute-all/docker-repo/data-ocean/qa/data-ocean-front-end
deeproute-all/docker-repo/dataocean/adas-web-term
deeproute-all/docker-repo/dataocean/dataocean-account-management-service
deeproute-all/docker-repo/dataocean/dataocean-adas-monitor
deeproute-all/docker-repo/dataocean/dataocean-adas-tar
deeproute-all/docker-repo/dataocean/dataocean-api-docs
deeproute-all/docker-repo/dataocean/dataocean-api-gateway
deeproute-all/docker-repo/dataocean/dataocean-authentication-service
deeproute-all/docker-repo/dataocean/dataocean-availability-service
deeproute-all/docker-repo/dataocean/dataocean-bag-construction
deeproute-all/docker-repo/dataocean/dataocean-bag-constructor
deeproute-all/docker-repo/dataocean/dataocean-car-transmit
deeproute-all/docker-repo/dataocean/dataocean-cloud-vehicle-communication
deeproute-all/docker-repo/dataocean/dataocean-data-agent
deeproute-all/docker-repo/dataocean/dataocean-data-agent-overseer
deeproute-all/docker-repo/dataocean/dataocean-data-construction
deeproute-all/docker-repo/dataocean/dataocean-data-migration
deeproute-all/docker-repo/dataocean/dataocean-data-provider-service
deeproute-all/docker-repo/dataocean/dataocean-data-transformation-service
deeproute-all/docker-repo/dataocean/dataocean-data-tube-service
deeproute-all/docker-repo/dataocean/dataocean-data-tunnel
deeproute-all/docker-repo/dataocean/dataocean-database-migration
deeproute-all/docker-repo/dataocean/dataocean-database-migration-data-managment-v2
deeproute-all/docker-repo/dataocean/dataocean-download-test
deeproute-all/docker-repo/dataocean/dataocean-dr-pipeline
deeproute-all/docker-repo/dataocean/dataocean-drfile-archive-serivce
deeproute-all/docker-repo/dataocean/dataocean-drtag
deeproute-all/docker-repo/dataocean/dataocean-file-download-service
deeproute-all/docker-repo/dataocean/dataocean-file-galaxy
failed to garbage collect: failed to mark: filesystem: filesystem: invalid checksum digest format
~/go/src/github.com/distribution/distribution (link ✗) ll /mnt/cephfs/docker/registry/v2/repositories/infra-operation/image/build-env-image/_manifests/revisions/sha256/a68799c5a48d4cfe46ea1e122245097441f14720b0488d1d162eea637e3b96f7/link
-rw-r--r-- 1 10000 71 Apr 28  2022 /mnt/cephfs/docker/registry/v2/repositories/infra-operation/image/build-env-image/_manifests/revisions/sha256/a68799c5a48d4cfe46ea1e122245097441f14720b0488d1d162eea637e3b96f7/link
~/go/src/github.com/distribution/distribution (link ✗) cat /mnt/cephfs/docker/registry/v2/repositories/infra-operation/image/build-env-image/_manifests/revisions/sha256/a68799c5a48d4cfe46ea1e122245097441f14720b0488d1d162eea637e3b96f7/link
~/go/src/github.com/distribution/distribution (link ✗) ll /mnt/cephfs/docker/registry/v2/repositories/deeproute-simulation/single-frame-task-images/_manifests/revisions/sha256/20f5baba4c1fdd30c9e13984365ed873b06a945afb877569219daee8870da30b/link
-rw-r--r-- 1 10000 0 Jun 24  2022 /mnt/cephfs/docker/registry/v2/repositories/deeproute-simulation/single-frame-task-images/_manifests/revisions/sha256/20f5baba4c1fdd30c9e13984365ed873b06a945afb877569219daee8870da30b/link

@milosgajdos
Copy link
Member

These link files are either empty or have invalid digests

I'd prefer looking into the underlying issue rather than adding a bandaid on a symptom.

@microyahoo
Copy link
Contributor Author

microyahoo commented May 8, 2024

These link files are either empty or have invalid digests

I'd prefer looking into the underlying issue rather than adding a bandaid on a symptom.

Maybe this should be considered quite common. Taking the example of pushing an image, it involves several HTTP requests. For the push process, the entire operation is non-atomic, so it's very likely that failures or other anomalies midway could lead to data corrupt or incompleteness. This is when running GC would cause errors. I think we can skip the anomalous data.

@milosgajdos
Copy link
Member

What happens to anomalous data if you leave them dangling in the storage and then a push with the same references is attempted?

@microyahoo
Copy link
Contributor Author

What happens to anomalous data if you leave them dangling in the storage and then a push with the same references is attempted?

the link file will be overwritten if push the same references.

if err != nil {
if err == digest.ErrDigestInvalidFormat {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, how did you come across broken digest in your storage? How did it get there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run GC, it exits abnormally. To locate the issue, I added some logs and found that the error occurs when parsing the digest while reading some link paths. Then I checked the storage and discovered that these link paths are either empty or corrupt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but that's worrisome right? Because we might be addressing a symptom here, not the root cause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but I don't have a good way to reproduce it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the link paths are empty/corrupt then wouldn't the better idea just to be to remove them and warn rather then leave them be? I worry we get into a situation with a bunch of these corrupt/empty link files sitting in someones blob store and the only way to remove them would be via manual intervention.

Copy link
Member

@milosgajdos milosgajdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squizzi mind having a look at this?

@milosgajdos milosgajdos requested a review from squizzi June 28, 2024 20:43
@squizzi
Copy link
Collaborator

squizzi commented Nov 5, 2024

I'd prefer looking into the underlying issue rather than adding a bandaid on a symptom.

Honestly, both is probably the right way to go here. If GC is leaving a lot of cruft behind that's concerning because it's not actually garbage collecting. If every abnormal run of GC is stopping because it's hitting an error this could result in a lot of failed GC runs. I'm in support of band-aiding it but perhaps filing an issue or something to look into the underlying cause as a follow-up.

@milosgajdos
Copy link
Member

milosgajdos commented Nov 5, 2024

So the question here for me is: is the GC leaving broken links behind or is it failed push. Because if it's the former I'm ok for merging this in, if it's the latter then we have a bigger problem on our hands 😓

Yeah maybe we can proceed with this and opening an issue to make note of it so it doesnt get forgotten. Mind actioning this @microyahoo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants