Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new source hashing methods: content_sha256, content_sha384, content_sha512 #5277

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

jaimergp
Copy link
Contributor

@jaimergp jaimergp commented Apr 12, 2024

Description

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

@jaimergp jaimergp requested a review from a team as a code owner April 12, 2024 15:41
@jaimergp jaimergp marked this pull request as draft April 12, 2024 15:41
@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Apr 12, 2024
Copy link

codspeed-hq bot commented Apr 12, 2024

CodSpeed Performance Report

Merging #5277 will not alter performance

Comparing jaimergp:content-hash (bcc7ad5) with main (3cf75b6)

Summary

✅ 5 untouched benchmarks

@wolfv
Copy link
Contributor

wolfv commented Apr 15, 2024

I think this is cool. It would also work nicely with the new proposal for "rendered recipes" (conda/ceps#74).

On that note - should we continue adding features to conda-build without any standardization (e.g. CEP) process?

@jaimergp
Copy link
Contributor Author

should we continue adding features to conda-build without any standardization (e.g. CEP) process?

I'm planning to submit a CEP. I opened this draft to explore what kind of things are needed for a stable yet robust logic, cross platform. Things like permissions and so on don't translate well to Windows.

@wolfv
Copy link
Contributor

wolfv commented Apr 15, 2024

Awesome. Yeah, I also recently looked at a few content hash implementations in Rust but didn't find anything super convincing yet. There are a bunch though (https://crates.io/search?q=content%20hash)

@jaimergp
Copy link
Contributor Author

So far the scheme I followed looks a lot like https://github.com/DrSLDR/dasher?tab=readme-ov-file#hashing-scheme. Things to standardize would be how the tree is sorted, the normalization of the path, the separators (to prevent this), and the allowed algorithms.

I've seen a few merkle tree based packages but we don't need all the proof stuff, or leaf querying; just comparing the root hash.

Maybe it could be implemented in a recursive way that doesn't involve obtaining the whole file tree beforehand if that increases performance or simplifies implementation elsewhere. IMO this feels like one of those CEPs that does require prototyping first to see which things have to be standardized.

@jaimergp
Copy link
Contributor Author

pre-commit.ci autofix

@jaimergp jaimergp changed the title add content_sha256 hash checks Add new source hashing methods: content_md5, content_sha1, content_sha256 Nov 20, 2024
@wolfv
Copy link
Contributor

wolfv commented Nov 26, 2024

As we are the ones hashing (and validating the hash) as opposed to take published hash values from somewhere, I don't see a reason to have any hash other than SHA256.

In fact, I would be fine to only have a content_hash field that does not include the hash type at all.

@jaimergp
Copy link
Contributor Author

I'd rather keep the algorithm suffix. See also #4793 for other algos we might want to support. It takes very little code to add (or remove) algos to the scheme. As long as it's accepted by hashlib.new(), that is.

@beckermr
Copy link
Contributor

@wolfv one day far into the future we will regret not specifying the hash type in the name of the hash field. I cannot imagine the pain we'd be in right now if we had done that with MD5. 😨

@wolfv
Copy link
Contributor

wolfv commented Nov 26, 2024

Well I still don't see a point in adding additional algorithms beyond "it's easy" - nobody should use them.

@beckermr
Copy link
Contributor

@jaimergp Is it possible to adjust the CEP to allow any hash supported by the python stdlib?

Having to vote over and over again on adding stuff seems rather annoying?

@jaimergp
Copy link
Contributor Author

Is it possible to adjust the CEP to allow any hash supported by the python stdlib?

The CEP doesn't even mention which hash should be used. It's a scheme for hashing directories with an algorithm of your choice.

@wolfv
Copy link
Contributor

wolfv commented Nov 26, 2024

Also how do we envision this to be used?

I have two use cases in mind for rattler-build:

  • After the fact, record the content-hash for future re-builds as extra-assurance in case the SHA256 of the artifact becomes outdated.
  • We could use it as an input alternative to sha256 in the URL field. However, to create the content hash, I would add a CLI option in rattler-build that would look something like:

rattler-build prepare-source --version 0.1.2 --recipe ./recipe.yaml and then the content hash gets injected or printed to the CLI.

@jaimergp
Copy link
Contributor Author

Well I still don't see a point in adding additional algorithms beyond "it's easy" - nobody should use them.

My point is that it's trivial to change this at any point during the review process; I'm more concerned with the actual hashing scheme proposed. The keys being added to the meta.yaml are secondary to that. For clarity, I'm fine with just content_sha256 and nothing else. Is there a reason an old system wouldn't have access to sha256? They'd have bigger problems, right?

@jaimergp
Copy link
Contributor Author

I would add a CLI option in rattler-build that would look something like [...]

Hm, true, we could add a little subcommand here to make it easier. Although honestly, I usually run conda-build, wait for the hashing mismatch error, and then copy the correct one 😬 (That's why I amended the text in the errors hah).

beckermr
beckermr previously approved these changes Nov 26, 2024
@schuylermartin45
Copy link
Contributor

schuylermartin45 commented Nov 26, 2024

I don't think it is wise to add support for two hashing algorithms with known vulnerabilities in 2024. Although it may be unlikely, that smells like an avenue for a supply chain attack to me.

I think if we were to support multiple hashing algorithms, we should support algorithms that are still deemed viable and secure, like the other SHA-* bit lengths.

@jaimergp
Copy link
Contributor Author

#4793 will probably land soon. I'll update this branch with it once it reaches main and then remove the weak hashes. If we are that concerned about md5 and sha1 being used, we should also study the possibility of deprecating them or at least warning about them in the logs.

@beckermr
Copy link
Contributor

We'll need a lot of advanced notice to deprecate them on conda-forge. For now we should probably add a lint + minimigrator to move to sha256.

@wolfv
Copy link
Contributor

wolfv commented Nov 26, 2024

I am not concerned, I just don't understand why we want to add them? IMO it doesn't add value. Having the MD5 hash available for the regular hash makes sense because some pacakges might publish the known-good value (and that can be used in the recipe), but for something that we have invented ourselves I don't see a use-case where it is justified to use anything other than the best available hash.

@jaimergp jaimergp changed the title Add new source hashing methods: content_md5, content_sha1, content_sha256 Add new source hashing methods: content_sha256, content_sha384, content_sha512 Nov 26, 2024
beckermr
beckermr previously approved these changes Nov 27, 2024
Copy link
Contributor

@beckermr beckermr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We should probably vote on the CEP before merging.

@jezdez
Copy link
Member

jezdez commented Nov 27, 2024

Setting as blocked on the CEP vote

Comment on lines +2056 to +2057
try:
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That second try block isn't needed, since we can catch multiple exception types in the same block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I want to catch the potential OSError in the except UnicodeDecodeError arm. Will that raised exception be caught in the try/except block? IOW, will this print "Hello"? I don't think it does:

try:
  raise ValueError
except ValueError:
  raise RuntimeError
except RuntimeError:
  print("Hellow!")

str
The hexdigest of the computed hash, as described above.
"""
log = get_logger(__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move that to the top of the module per best practice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the practice followed in the other functions, fwiw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see why. The get_logger utility is defined in that module, so there's no top-level function to use. If anything, it would go at the bottom of the module? Do you prefer that or shall we leave it in-function?

Comment on lines 2058 to 2065
lines = []
with open(path) as fh:
for line in fh:
# Accumulate all line-ending normalized lines first
# to make sure the whole file is read. This prevents
# partial updates to the hash with hybrid text/binary
# files (e.g. like the constructor shell installers).
lines.append(line.replace("\r\n", "\n"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that might be a memory hog, depending on how big the files are you're normalizing, it might be best to write this to a temp file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good point. Didn't the stdlib have a temporary file object that only writes to disk after a certain size? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, SpooledTemporaryFile. Added it in c192799 (#5277)

Comment on lines 2073 to 2079
except OSError as exc:
log.warning(
"Can't open file %s. Hashing path only...", path.name, exc_info=exc
)
else:
log.warning("Can't detect type for path %s. Hashing path only...", path)
hasher.update(b"?")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely following this error state handling, why doesn't this stop the process since it can't read the file? Wouldn't that indicate that the recipe is faulty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know what kind of files a user will have in that directory. They might point path to something containing a device file or who knows what. Not really common practice, but that doesn't mean that their source is invalid or that we can't verify that the other stuff is actually the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this error out because it's essentially a file we can't verify, and we don't know what it might be hiding. If it causes errors, users can deliberately skip it via the skip parameter.

@jaimergp
Copy link
Contributor Author

@jezdez, the CEP passed, so I guess this review can be dismissed now.

@jaimergp jaimergp requested a review from a team December 21, 2024 10:54
Copy link
Contributor

@beckermr beckermr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed [bot] added once the contributor has signed the CLA
Projects
Status: 🏗️ In Progress
Development

Successfully merging this pull request may close these issues.

Better hashing of sources
7 participants