Add new source hashing methods: `content_sha256`, `content_sha384`, `content_sha512` #5277

jaimergp · 2024-04-12T15:40:59Z

Description

Closes Better hashing of sources #4762.
Relevant CEP submitted at Standardize algorithm for directory hashing ceps#100.

Checklist - did you ...

Add a file to the news directory (using the template) for the next release's release notes?
Add / update necessary tests?
Add / update outdated documentation?

codspeed-hq · 2024-04-12T16:00:12Z

CodSpeed Performance Report

Merging #5277 will not alter performance

_{Comparing jaimergp:content-hash (bcc7ad5) with main (3cf75b6)}

Summary

✅ 5 untouched benchmarks

wolfv · 2024-04-15T09:13:49Z

I think this is cool. It would also work nicely with the new proposal for "rendered recipes" (conda/ceps#74).

On that note - should we continue adding features to conda-build without any standardization (e.g. CEP) process?

jaimergp · 2024-04-15T13:52:53Z

should we continue adding features to conda-build without any standardization (e.g. CEP) process?

I'm planning to submit a CEP. I opened this draft to explore what kind of things are needed for a stable yet robust logic, cross platform. Things like permissions and so on don't translate well to Windows.

wolfv · 2024-04-15T13:59:30Z

Awesome. Yeah, I also recently looked at a few content hash implementations in Rust but didn't find anything super convincing yet. There are a bunch though (https://crates.io/search?q=content%20hash)

jaimergp · 2024-04-15T16:23:46Z

So far the scheme I followed looks a lot like https://github.com/DrSLDR/dasher?tab=readme-ov-file#hashing-scheme. Things to standardize would be how the tree is sorted, the normalization of the path, the separators (to prevent this), and the allowed algorithms.

I've seen a few merkle tree based packages but we don't need all the proof stuff, or leaf querying; just comparing the root hash.

Maybe it could be implemented in a recursive way that doesn't involve obtaining the whole file tree beforehand if that increases performance or simplifies implementation elsewhere. IMO this feels like one of those CEPs that does require prototyping first to see which things have to be standardized.

jaimergp · 2024-11-20T00:33:45Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

wolfv · 2024-11-26T12:09:49Z

As we are the ones hashing (and validating the hash) as opposed to take published hash values from somewhere, I don't see a reason to have any hash other than SHA256.

In fact, I would be fine to only have a content_hash field that does not include the hash type at all.

jaimergp · 2024-11-26T12:12:51Z

I'd rather keep the algorithm suffix. See also #4793 for other algos we might want to support. It takes very little code to add (or remove) algos to the scheme. As long as it's accepted by hashlib.new(), that is.

beckermr · 2024-11-26T12:13:17Z

@wolfv one day far into the future we will regret not specifying the hash type in the name of the hash field. I cannot imagine the pain we'd be in right now if we had done that with MD5. 😨

wolfv · 2024-11-26T12:14:20Z

Well I still don't see a point in adding additional algorithms beyond "it's easy" - nobody should use them.

beckermr · 2024-11-26T12:14:21Z

@jaimergp Is it possible to adjust the CEP to allow any hash supported by the python stdlib?

Having to vote over and over again on adding stuff seems rather annoying?

jaimergp · 2024-11-26T12:16:10Z

Is it possible to adjust the CEP to allow any hash supported by the python stdlib?

The CEP doesn't even mention which hash should be used. It's a scheme for hashing directories with an algorithm of your choice.

wolfv · 2024-11-26T12:17:00Z

Also how do we envision this to be used?

I have two use cases in mind for rattler-build:

After the fact, record the content-hash for future re-builds as extra-assurance in case the SHA256 of the artifact becomes outdated.
We could use it as an input alternative to sha256 in the URL field. However, to create the content hash, I would add a CLI option in rattler-build that would look something like:

rattler-build prepare-source --version 0.1.2 --recipe ./recipe.yaml and then the content hash gets injected or printed to the CLI.

jaimergp · 2024-11-26T12:19:11Z

Well I still don't see a point in adding additional algorithms beyond "it's easy" - nobody should use them.

My point is that it's trivial to change this at any point during the review process; I'm more concerned with the actual hashing scheme proposed. The keys being added to the meta.yaml are secondary to that. For clarity, I'm fine with just content_sha256 and nothing else. Is there a reason an old system wouldn't have access to sha256? They'd have bigger problems, right?

jaimergp · 2024-11-26T12:21:13Z

I would add a CLI option in rattler-build that would look something like [...]

Hm, true, we could add a little subcommand here to make it easier. Although honestly, I usually run conda-build, wait for the hashing mismatch error, and then copy the correct one 😬 (That's why I amended the text in the errors hah).

schuylermartin45 · 2024-11-26T14:27:31Z

I don't think it is wise to add support for two hashing algorithms with known vulnerabilities in 2024. Although it may be unlikely, that smells like an avenue for a supply chain attack to me.

I think if we were to support multiple hashing algorithms, we should support algorithms that are still deemed viable and secure, like the other SHA-* bit lengths.

jaimergp · 2024-11-26T14:48:57Z

#4793 will probably land soon. I'll update this branch with it once it reaches main and then remove the weak hashes. If we are that concerned about md5 and sha1 being used, we should also study the possibility of deprecating them or at least warning about them in the logs.

beckermr · 2024-11-26T14:54:46Z

We'll need a lot of advanced notice to deprecate them on conda-forge. For now we should probably add a lint + minimigrator to move to sha256.

wolfv · 2024-11-26T14:59:46Z

I am not concerned, I just don't understand why we want to add them? IMO it doesn't add value. Having the MD5 hash available for the regular hash makes sense because some pacakges might publish the known-good value (and that can be used in the recipe), but for something that we have invented ourselves I don't see a use-case where it is justified to use anything other than the best available hash.

beckermr

LGTM! We should probably vote on the CEP before merging.

jezdez · 2024-11-27T21:23:53Z

Setting as blocked on the CEP vote

jezdez · 2024-11-27T21:27:01Z

conda_build/utils.py

+            try:
+                try:


That second try block isn't needed, since we can catch multiple exception types in the same block

Sure, but I want to catch the potential OSError in the except UnicodeDecodeError arm. Will that raised exception be caught in the try/except block? IOW, will this print "Hello"? I don't think it does:

try: raise ValueError except ValueError: raise RuntimeError except RuntimeError: print("Hellow!")

jezdez · 2024-11-27T21:28:05Z

conda_build/utils.py

+    str
+        The hexdigest of the computed hash, as described above.
+    """
+    log = get_logger(__name__)


Let's move that to the top of the module per best practice

I was following the practice followed in the other functions, fwiw.

Ah, now I see why. The get_logger utility is defined in that module, so there's no top-level function to use. If anything, it would go at the bottom of the module? Do you prefer that or shall we leave it in-function?

jezdez · 2024-11-27T21:34:26Z

conda_build/utils.py

+                    lines = []
+                    with open(path) as fh:
+                        for line in fh:
+                            # Accumulate all line-ending normalized lines first
+                            # to make sure the whole file is read. This prevents
+                            # partial updates to the hash with hybrid text/binary
+                            # files (e.g. like the constructor shell installers).
+                            lines.append(line.replace("\r\n", "\n"))


Hmm, that might be a memory hog, depending on how big the files are you're normalizing, it might be best to write this to a temp file

Hm, good point. Didn't the stdlib have a temporary file object that only writes to disk after a certain size? 🤔

Yea, SpooledTemporaryFile. Added it in c192799 (#5277)

jezdez · 2024-11-27T21:36:57Z

conda_build/utils.py

+            except OSError as exc:
+                log.warning(
+                    "Can't open file %s. Hashing path only...", path.name, exc_info=exc
+                )
+        else:
+            log.warning("Can't detect type for path %s. Hashing path only...", path)
+            hasher.update(b"?")


I'm not completely following this error state handling, why doesn't this stop the process since it can't read the file? Wouldn't that indicate that the recipe is faulty?

We don't know what kind of files a user will have in that directory. They might point path to something containing a device file or who knows what. Not really common practice, but that doesn't mean that their source is invalid or that we can't verify that the other stuff is actually the same.

I made this error out because it's essentially a file we can't verify, and we don't know what it might be hiding. If it causes errors, users can deliberately skip it via the skip parameter.

jaimergp · 2024-12-21T10:54:06Z

@jezdez, the CEP passed, so I guess this review can be dismissed now.

beckermr

Some comments.

add content_sha256 hash checks

eac67a9

jaimergp requested a review from a team as a code owner April 12, 2024 15:41

jaimergp marked this pull request as draft April 12, 2024 15:41

conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Apr 12, 2024

jaimergp added 7 commits April 12, 2024 18:41

fix algo id

af571af

pre-commit

08d7691

extend tests and include path, type and executable bit in the hash

19235f1

make it cross-platform

704ba21

add news

0426db2

use dash separator

47fe18d

update hashes

ab810a4

jaimergp added 7 commits June 18, 2024 20:46

Merge branch 'main' into content-hash

91d3a4d

Update source.py

4e0f6dd

Merge branch 'main' of github.com:conda/conda-build into content-hash

002b309

change algorithm a bit and update tests

1439e4e

move to Path.rglob() and allow skips

4f4178b

register new keys

190e120

update recipe

4b3d56d

[pre-commit.ci] auto fixes from pre-commit.com hooks

f513069

for more information, see https://pre-commit.ci

jaimergp mentioned this pull request Nov 20, 2024

Standardize algorithm for directory hashing conda/ceps#100

Merged

jaimergp changed the title ~~add content_sha256 hash checks~~ Add new source hashing methods: content_md5, content_sha1, content_sha256 Nov 20, 2024

jaimergp added 3 commits November 20, 2024 01:51

add docs

c409505

pre-commit

5327e4a

normalize line endings

27b9eaf

pre-commit

d97d081

beckermr previously approved these changes Nov 26, 2024

View reviewed changes

jaimergp added 2 commits November 26, 2024 16:02

Merge branch 'main' of github.com:conda/conda-build into content-hash

36f23c3

drop content_{md5,sha1} and add content_{sha384,sha512}

2afa293

jaimergp dismissed beckermr’s stale review via 2afa293 November 26, 2024 15:09

add here too

98d8813

jaimergp changed the title ~~Add new source hashing methods: content_md5, content_sha1, content_sha256~~ Add new source hashing methods: content_sha256, content_sha384, content_sha512 Nov 26, 2024

beckermr previously approved these changes Nov 27, 2024

View reviewed changes

jezdez requested changes Nov 27, 2024

View reviewed changes

use a 10MB SpooledTemporaryFile

c192799

jaimergp dismissed beckermr’s stale review via c192799 November 27, 2024 22:05

jaimergp added 2 commits November 27, 2024 23:07

pre-commit

5edfb20

do error on unreadable files and unknown types

bcc7ad5

jaimergp requested a review from a team December 21, 2024 10:54

beckermr approved these changes Dec 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new source hashing methods: `content_sha256`, `content_sha384`, `content_sha512` #5277

Add new source hashing methods: `content_sha256`, `content_sha384`, `content_sha512` #5277

jaimergp commented Apr 12, 2024 •

edited

Loading

codspeed-hq bot commented Apr 12, 2024 •

edited

Loading

wolfv commented Apr 15, 2024

jaimergp commented Apr 15, 2024

wolfv commented Apr 15, 2024

jaimergp commented Apr 15, 2024

jaimergp commented Nov 20, 2024

wolfv commented Nov 26, 2024

jaimergp commented Nov 26, 2024

beckermr commented Nov 26, 2024

wolfv commented Nov 26, 2024

beckermr commented Nov 26, 2024

jaimergp commented Nov 26, 2024

wolfv commented Nov 26, 2024

jaimergp commented Nov 26, 2024

jaimergp commented Nov 26, 2024

schuylermartin45 commented Nov 26, 2024 •

edited

Loading

jaimergp commented Nov 26, 2024

beckermr commented Nov 26, 2024

wolfv commented Nov 26, 2024

beckermr left a comment

jezdez commented Nov 27, 2024

jezdez Nov 27, 2024

jaimergp Nov 27, 2024

jezdez Nov 27, 2024

jaimergp Nov 27, 2024

jaimergp Nov 27, 2024

jezdez Nov 27, 2024

jaimergp Nov 27, 2024

jaimergp Nov 27, 2024

jezdez Nov 27, 2024

jaimergp Nov 27, 2024

jaimergp Nov 27, 2024

jaimergp commented Dec 21, 2024

beckermr left a comment

Add new source hashing methods: content_sha256, content_sha384, content_sha512 #5277

Are you sure you want to change the base?

Add new source hashing methods: content_sha256, content_sha384, content_sha512 #5277

Conversation

jaimergp commented Apr 12, 2024 • edited Loading

Description

Checklist - did you ...

codspeed-hq bot commented Apr 12, 2024 • edited Loading

Merging #5277 will not alter performance

Summary

wolfv commented Apr 15, 2024

jaimergp commented Apr 15, 2024

wolfv commented Apr 15, 2024

jaimergp commented Apr 15, 2024

jaimergp commented Nov 20, 2024

wolfv commented Nov 26, 2024

jaimergp commented Nov 26, 2024

beckermr commented Nov 26, 2024

wolfv commented Nov 26, 2024

beckermr commented Nov 26, 2024

jaimergp commented Nov 26, 2024

wolfv commented Nov 26, 2024

jaimergp commented Nov 26, 2024

jaimergp commented Nov 26, 2024

schuylermartin45 commented Nov 26, 2024 • edited Loading

jaimergp commented Nov 26, 2024

beckermr commented Nov 26, 2024

wolfv commented Nov 26, 2024

beckermr left a comment

Choose a reason for hiding this comment

jezdez commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaimergp commented Dec 21, 2024

beckermr left a comment

Choose a reason for hiding this comment

Add new source hashing methods: `content_sha256`, `content_sha384`, `content_sha512` #5277

Add new source hashing methods: `content_sha256`, `content_sha384`, `content_sha512` #5277

jaimergp commented Apr 12, 2024 •

edited

Loading

codspeed-hq bot commented Apr 12, 2024 •

edited

Loading

schuylermartin45 commented Nov 26, 2024 •

edited

Loading