-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new source hashing methods: content_sha256
, content_sha384
, content_sha512
#5277
base: main
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #5277 will not alter performanceComparing Summary
|
I think this is cool. It would also work nicely with the new proposal for "rendered recipes" (conda/ceps#74). On that note - should we continue adding features to conda-build without any standardization (e.g. CEP) process? |
I'm planning to submit a CEP. I opened this draft to explore what kind of things are needed for a stable yet robust logic, cross platform. Things like permissions and so on don't translate well to Windows. |
Awesome. Yeah, I also recently looked at a few content hash implementations in Rust but didn't find anything super convincing yet. There are a bunch though (https://crates.io/search?q=content%20hash) |
So far the scheme I followed looks a lot like https://github.com/DrSLDR/dasher?tab=readme-ov-file#hashing-scheme. Things to standardize would be how the tree is sorted, the normalization of the path, the separators (to prevent this), and the allowed algorithms. I've seen a few merkle tree based packages but we don't need all the proof stuff, or leaf querying; just comparing the root hash. Maybe it could be implemented in a recursive way that doesn't involve obtaining the whole file tree beforehand if that increases performance or simplifies implementation elsewhere. IMO this feels like one of those CEPs that does require prototyping first to see which things have to be standardized. |
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
content_sha256
hash checkscontent_md5
, content_sha1
, content_sha256
As we are the ones hashing (and validating the hash) as opposed to take published hash values from somewhere, I don't see a reason to have any hash other than SHA256. In fact, I would be fine to only have a |
I'd rather keep the algorithm suffix. See also #4793 for other algos we might want to support. It takes very little code to add (or remove) algos to the scheme. As long as it's accepted by |
@wolfv one day far into the future we will regret not specifying the hash type in the name of the hash field. I cannot imagine the pain we'd be in right now if we had done that with MD5. 😨 |
Well I still don't see a point in adding additional algorithms beyond "it's easy" - nobody should use them. |
@jaimergp Is it possible to adjust the CEP to allow any hash supported by the python stdlib? Having to vote over and over again on adding stuff seems rather annoying? |
The CEP doesn't even mention which hash should be used. It's a scheme for hashing directories with an algorithm of your choice. |
Also how do we envision this to be used? I have two use cases in mind for rattler-build:
|
My point is that it's trivial to change this at any point during the review process; I'm more concerned with the actual hashing scheme proposed. The keys being added to the |
Hm, true, we could add a little subcommand here to make it easier. Although honestly, I usually run |
I don't think it is wise to add support for two hashing algorithms with known vulnerabilities in 2024. Although it may be unlikely, that smells like an avenue for a supply chain attack to me. I think if we were to support multiple hashing algorithms, we should support algorithms that are still deemed viable and secure, like the other SHA-* bit lengths. |
#4793 will probably land soon. I'll update this branch with it once it reaches main and then remove the weak hashes. If we are that concerned about |
We'll need a lot of advanced notice to deprecate them on conda-forge. For now we should probably add a lint + minimigrator to move to sha256. |
I am not concerned, I just don't understand why we want to add them? IMO it doesn't add value. Having the MD5 hash available for the regular hash makes sense because some pacakges might publish the known-good value (and that can be used in the recipe), but for something that we have invented ourselves I don't see a use-case where it is justified to use anything other than the best available hash. |
content_md5
, content_sha1
, content_sha256
content_sha256
, content_sha384
, content_sha512
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We should probably vote on the CEP before merging.
Setting as blocked on the CEP vote |
try: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That second try block isn't needed, since we can catch multiple exception types in the same block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I want to catch the potential OSError in the except UnicodeDecodeError
arm. Will that raised exception be caught in the try/except block? IOW, will this print "Hello"? I don't think it does:
try:
raise ValueError
except ValueError:
raise RuntimeError
except RuntimeError:
print("Hellow!")
conda_build/utils.py
Outdated
str | ||
The hexdigest of the computed hash, as described above. | ||
""" | ||
log = get_logger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move that to the top of the module per best practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the practice followed in the other functions, fwiw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I see why. The get_logger
utility is defined in that module, so there's no top-level function to use. If anything, it would go at the bottom of the module? Do you prefer that or shall we leave it in-function?
conda_build/utils.py
Outdated
lines = [] | ||
with open(path) as fh: | ||
for line in fh: | ||
# Accumulate all line-ending normalized lines first | ||
# to make sure the whole file is read. This prevents | ||
# partial updates to the hash with hybrid text/binary | ||
# files (e.g. like the constructor shell installers). | ||
lines.append(line.replace("\r\n", "\n")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that might be a memory hog, depending on how big the files are you're normalizing, it might be best to write this to a temp file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good point. Didn't the stdlib have a temporary file object that only writes to disk after a certain size? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, SpooledTemporaryFile
. Added it in c192799
(#5277)
conda_build/utils.py
Outdated
except OSError as exc: | ||
log.warning( | ||
"Can't open file %s. Hashing path only...", path.name, exc_info=exc | ||
) | ||
else: | ||
log.warning("Can't detect type for path %s. Hashing path only...", path) | ||
hasher.update(b"?") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely following this error state handling, why doesn't this stop the process since it can't read the file? Wouldn't that indicate that the recipe is faulty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know what kind of files a user will have in that directory. They might point path
to something containing a device file or who knows what. Not really common practice, but that doesn't mean that their source is invalid or that we can't verify that the other stuff is actually the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this error out because it's essentially a file we can't verify, and we don't know what it might be hiding. If it causes errors, users can deliberately skip it via the skip
parameter.
@jezdez, the CEP passed, so I guess this review can be dismissed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments.
Description
Checklist - did you ...
news
directory (using the template) for the next release's release notes?