Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix --hash-unmatched behavior #615

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open

Conversation

intelfx
Copy link

@intelfx intelfx commented Mar 8, 2023

The problem with --hash-unmatched is twofold.

First, the criterion in rm_shred_group_update_status() that is responsible for hashing partially-hashed groups to the end is broken. It can never be fulfilled because group->head_files is always empty at this point, and even if it was not, head->digest is always NULL because that's what rm_shred_group_push_file() does.

The only reason why --hash-unmatched even works is because the --merge-directories condition above is broken as well and it always fires for all single-file groups. (Consequently, --hash-unmatched without --merge-directories is broken in develop.) However, the same bug also means that all unique files will always get hashed to the end if --merge-directories is set, thus making --hash-unmatched behave like --hash-uniques.

This function only works as designed if neither --hash-unmatched nor --hash-uniques is set, because in this case all single-file groups get filtered early by the rm_shred_group_qualifies() check.

Fixes #614.

@intelfx intelfx force-pushed the work/fix-614 branch 5 times, most recently from 77f32ed to 0daf500 Compare March 9, 2023 08:44
@intelfx intelfx marked this pull request as ready for review March 9, 2023 08:45
@intelfx intelfx marked this pull request as draft March 9, 2023 08:48
@intelfx intelfx force-pushed the work/fix-614 branch 4 times, most recently from b0d836b to a989ed2 Compare March 9, 2023 09:45
intelfx added 2 commits March 9, 2023 13:53
The intention behind this criterion seems to handle partially-hashed
(i. e. child) groups that do not fulfill requirements to continue
hashing them (i. e. only contain a single file after sifting). Normally
such groups are left alone (because we have hashed enough of the file to
understand it is unique), however by definition of --hash-unmatched we
have to hash these files to the end.

The problem is that when rm_shred_group_update_status() is called,
group->held_files does not contain the currently pushed file. Thus,
because this condition is only executed for single-file groups
(group->n_clusters == 1), group->held_files will never be non-empty.
Anyway, even if it was non-empty, head->digest will always be NULL
because held files have digests freed in rm_shred_group_push_file().

Therefore, this condition will never be fulfilled. The only reason why
--hash-unmatched even works at all is due to the broken hardlinked files
condition that matches all single-inode groups. (Another consequence is
that --hash-unmatched is broken if --merge-directories is not used.)
`group->n_inodes == 1` is also true for groups that simply consist of a
single file. This condition will cause all single-file groups to be
hashed if `--merge-directories` is also set.

Additionally, the whole `group->n_inodes == 1` condition is redundant
because not following on the branch means that `group->n_clusters == 1`
and therefore `group->n_inodes == 1`.

Thus, check that we are actually dealing with a group of hardlinks (and
not just a single-file group that did not cause an early return because
we are also doing `--hash-unmatched`).

Fixes sahib#614.
@intelfx intelfx changed the title shredder: do not hash single-file groups unless --hash-uniques Fix --hash-unmatched behavior Mar 9, 2023
@intelfx intelfx marked this pull request as ready for review March 9, 2023 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant