Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize algorithm for directory hashing #100

Merged
merged 11 commits into from
Dec 19, 2024

Conversation

jaimergp
Copy link
Contributor

@jaimergp jaimergp commented Nov 20, 2024

Standardizes findings in conda/conda-build#5277.

🔎 Preview in Markdown

@jaimergp jaimergp marked this pull request as draft November 20, 2024 00:40
@jaimergp
Copy link
Contributor Author

cc @wolfv

@baszalmstra
Copy link
Contributor

In pixi we often compute these hashes in a similar way. We also normalize line endings in text files because on windows git often normalizes these as well.

@jaimergp
Copy link
Contributor Author

In pixi we often compute these hashes in a similar way. We also normalize line endings in text files because on windows git often normalizes these as well.

Nice that you mentioned this because I think I just run into that very issue: https://github.com/conda/conda-build/actions/runs/11924575237/job/33235172020?pr=5277#step:11:424

cep-00??.md Outdated

## Specification

Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path. For each entry in the contents table, compute the hash for the concatenation of:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to better specify what "sort" means, and in particular, what collation ordering we want to use. Because:

$ echo -e "aa\na-a\nab\na-b" | LC_ALL=en_US.UTF-8 sort
a-a
aa
a-b
ab

$ echo -e "aa\na-a\nab\na-b" | LC_ALL=C.UTF-8 sort
a-a
a-b
aa
ab

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like Python's sort([str, ...]), it's locale agnostic. See 378d1fd (#100)

@jaimergp
Copy link
Contributor Author

jaimergp commented Nov 25, 2024

I'm happy with how this is looking now. The implementation in conda-build is passing the necessary tests. Any thoughts about this @wolfv @baszalmstra? Otherwise I'll probably start a vote soon once the @conda/builds-tools team is happy with the PR.

@baszalmstra
Copy link
Contributor

We have not started an implementation yet but it seems straightforward enough. Id be happy to vote on this now.

@jaimergp
Copy link
Contributor Author

@conda/steering-council, I'd like to start a vote starting December the 4th (so we can pair it up with the ABI3 one).

So, if you'd like to make any CEP-altering comments, this is your chance. Once the vote starts, we should only make editorial corrections. Thanks!

@jaimergp jaimergp marked this pull request as ready for review November 27, 2024 15:51

## Motivation

Build tools like `conda-build` and `rattler-build` need to fetch the source of the project being packaged. The integrity of the download is checked by comparing its known hash (usually SHA256) against the obtained file. If they don't match, an error is raised.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they don't match, an error is raised.

This doesn't say what build tools should do when the files can't be read or an error is raised during computation, and the implementation in https://github.com/conda/conda-build/pull/5277/files#r1861269213 shows that it ignores file read errors (which can have many reasons) and only creates the content hash with the name of the file. I'd think that's a failure state for a content hash algorithm that aims to be consistent since it would expose it to file permission attacks etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow how this could be weaponized as an attack vector. Let's say we have a directory with three files with (name, content):

  • ('file1.txt', '123')
  • ('file2.txt', '456')
  • ('file3.txt', '789')

We end up computing the hash of these Utf-8 bytes:

file1.txtF123-file2.txtF456-file3.txtF789-
>>> import hashlib
>>> hashlib.md5(b"file1.txtF123-file2.txtF456-file3.txtF789-").hexdigest()
'54866bc311f08b2e082466b090cbe560'

If let's say file1.txt changes permissions and becomes unreadable, then the string would be:

file1.txtF-file2.txtF456-file3.txtF789-

, which has a different hash.

If file1.txt becomes a directory it would then be:

file1.txtD-file2.txtF456-file3.txtF789-

Same for an unknown file type (but ? instead of D).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, let's say that we had an empty file, and then a new source adds some malicious content there, but makes it unreadable. They would have the same hash because we are not marking errors... The build script would just need to make it readable with chmod later and then it can be weaponized. Ok, now I get it: we need to mark errors with a different separator: maybe E. Right?

Thank you for reading along!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we do need to error out, because an unreadable file might change contents arbitrarily and its hash would never change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 77e35c0

@jaimergp
Copy link
Contributor Author

jaimergp commented Dec 4, 2024

Hello @conda/steering-council. Requesting a vote for this CEP.

This vote falls under the "Enhancement Proposal Approval" policy of the conda governance policy,
please vote and/or comment on this proposal at your earliest convenience.

It needs 60% of the Steering Council to vote yes to pass.

To vote, please use the following form to vote.

This vote will end on 2024-12-18, End of Day, Anywhere on Earth (AoE).

@xhochy (Uwe Korn)

  • yes
  • no
  • abstain

@CJ-Wright (Christopher J. 'CJ' Wright)

  • yes
  • no
  • abstain

@mariusvniekerk (Marius van Niekerk)

  • yes
  • no
  • abstain

@chenghlee (Cheng H. Lee)

  • yes
  • no
  • abstain

@ocefpaf (Filipe Fernandes)

  • yes
  • no
  • abstain

@marcelotrevisani (Marcelo Duarte Trevisani)

  • yes
  • no
  • abstain

@msarahan (Michael Sarahan)

  • yes
  • no
  • abstain

@mbargull (Marcel Bargull)

  • yes
  • no
  • abstain

@jakirkham (John Kirkham)

  • yes
  • no
  • abstain

@jezdez (Jannis Leidel)

  • yes
  • no
  • abstain

@wolfv (Wolf Vollprecht)

  • yes
  • no
  • abstain

@jaimergp (Jaime Rodríguez-Guerra)

  • yes
  • no
  • abstain

@kkraus14 (Keith Kraus)

  • yes
  • no
  • abstain

@baszalmstra (Bas Zalmstra)

  • yes
  • no
  • abstain

@beckermr (Matthew R. Becker)

  • yes
  • no
  • abstain

@Hind-M (Hind Montassif)

  • yes
  • no
  • abstain

@trallard (Tania Allard)

  • yes
  • no
  • abstain

@mariusvniekerk
Copy link

How much value is there in the utf8 and text normalization pieces.

Presumable the hash of a prefix for different platforms will basically never be the same due to differences in packages.

@jaimergp
Copy link
Contributor Author

jaimergp commented Dec 4, 2024

Presumable the hash of a prefix for different platforms will basically never be the same due to differences in packages.

This is mostly directed at hashing the sources we fetch from remotes. It's usually the same source for all platforms. Not sure how useful this could be for hashing $PREFIX itself, but that's not planned anyway.

@jaimergp jaimergp added the vote Voting following governance policy label Dec 4, 2024
@jaimergp
Copy link
Contributor Author

jaimergp commented Dec 19, 2024

The vote is closed, and we have the following result:

  • Total possible voters: 17.
  • Valid votes: 10 / 17 = 58.8%.

Yes votes (10):

  1. @xhochy
  2. @chenghlee
  3. @ocefpaf
  4. @jezdez
  5. @wolfv
  6. @jaimergp
  7. @baszalmstra
  8. @beckermr
  9. @Hind-M
  10. @trallard

No votes (0).

Abstain votes (0).

Not voted (7):

  1. @CJ-Wright
  2. @jakirkham
  3. @mariusvniekerk
  4. @marcelotrevisani
  5. @msarahan
  6. @mbargull
  7. @kkraus14

Invalid votes (0).

10/17 votes = 58.8% does not meet the minimum quorum (60%), but timeout rules can be invoked because:

  • Vote was open for at least two weeks
  • Three reminders were sent (start, middle, end) via chat room and email
  • It was presented and discussed in at least one community meeting

Thus, the affirmative votes are computed against the valid votes, which results in 10/10 YES votes. This is 100%, which is more than the required 60%.

As per the governance, this means the vote passed the moment another @conda/steering-council member confirms that timeout quorum rules were met.

@beckermr beckermr merged commit 3cf99ab into conda:main Dec 19, 2024
1 check failed
@jaimergp jaimergp mentioned this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
vote Voting following governance policy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants