Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: DOWNLOAD - a new command for cache-aware downloads #3948

Open
idodod opened this issue Mar 27, 2024 · 5 comments
Open

Proposal: DOWNLOAD - a new command for cache-aware downloads #3948

idodod opened this issue Mar 27, 2024 · 5 comments
Labels
type:proposal A proposal for a new feature

Comments

@idodod
Copy link
Contributor

idodod commented Mar 27, 2024

Use case

As a developer, I often need to download files as part of my build.
This is can be done by executing curl or wget (RUN curl <url>).
However, since on the one hand the url might be constant, and on the other hand the file in the remote server might have changed, the RUN command will (by default) be cached, meaning the build might not be using the most recent version of the file as the developer intends.

An alternative behavior is to always force downloading the file by using RUN --no-cache , however this is inefficient since the file can be quite large and/or because the cache might get busted for subsequent steps in the target.

It would be good to introduce functionality in the Earthfile syntax that will support cache-aware downloads out of the box, similarly to how docker images pulls are aware of digests in a FROM <image> statement, GIT CLONE is aware of commit hashes, and how COPY can tell if a a file in the build context have changed.

Expected Behavior

To accomplish the above, we can introduce a new command - DOWNLOAD, which, under the hood, can utilize If-Modified-Since or If-None-Match headers to fulfill a conditional GET request under (This is provided the server maintains these tags).

If the above mentioned tags are not maintained by the server, a warning will printed to the user and the cache behavior would fallback to "always cached", similarly to how it's done today in RUN curl as described above (See --no-cache flag description below on changing this behavior).

Additionally, using earthly --verbose flag should display the tag values in the request and the response.

For example (How to use aws-cli):

VERSION 0.8

run-aws:
    FROM alpine
    DOWNLOAD  https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip
    RUN unzip awscliv2.zip
    RUN sudo ./aws/install
    RUN aws-cli ....

run-aws-old:
    FROM alpine
    RUN apk add curl
    curl  https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip
    RUN unzip awscliv2.zip
    RUN sudo ./aws/install
    RUN aws-cli ....

When the developer builds +run-aws, DOWNLOAD command will ensure the file is only download if it hasn't been downloaded. before or if the file did not update since the last time it was downloaded. This means that ultimately RUN aws-cli will be invalidated only when a new version of the executable is used.

For comparison, +run-aws-old cache will not be invalidated unless --no-cache is used so the execution of aws-cli might not always get invalidated.

An additional possible benefit of this new feature is that the developer would not need to install a tool like curl before attempting to download a file.

Some future/nice to have flags:

  1. --token <your-token> - pass authentication token in the request header for private servers.
  2. --output <file-path> - where the file should be downloaded to (default: working directory)
  3. --chmod - it's common for a downloaded file to get executed. With this flag, no additional RUN chmod <your-file> command is needed.
  4. --no-cache - force redownload of the file similarly to how this flag works in a RUN command.

Open Questions:

  1. Similar behavior might be desired for uploading file - should this command support the same for upload (and named accordingly) or should we introduce a separate command?
  2. Should a --no-cache flag always force redownloading of a file, or only in case the server does not support relevant tags?
  3. Should we let the the developer decide on which type of tag to base the conditional download request or should it be infered.
@idodod idodod added the type:proposal A proposal for a new feature label Mar 27, 2024
@vladaionescu
Copy link
Member

Incidentally, I think that Buildkit has some sort of support for this kind of "source" (the term that buildkit uses for cache-aware inputs like this), because Dockerfiles used to have the ADD command which could download and unzip archives. So implementation for this shouldn't be too difficult.

@jmgilman
Copy link

jmgilman commented Mar 27, 2024

Feedback on the questions:

  1. I would rate this as a lower priority, but if it turns out to be technically similar to downloading then it's probably worth throwing in at the same time.
  2. I would vote for the former: always redownloading. As a user, I would find it bizarre if I passed that flag and it didn't download the file (as the cache would be the only thing in the way).
  3. I think we identify the possible ways to infer a change, pick one as the default, and then add flags for the other types. Automatic detection is nice, but also sort of mystical, and I think being explicit here has benefits.

@vladaionescu vladaionescu changed the title A new command for cache-aware downloads Proposal: DOWNLOAD - a new command for cache-aware downloads Mar 27, 2024
@idodod
Copy link
Contributor Author

idodod commented Mar 28, 2024

Feedback on the questions:

  1. I would vote for the former: always redownloading. As a user, I would find it bizarre if I passed that flag and it didn't download the file (as the cache would be the only thing in the way).

I agree, I think what I meant is that the name/functionality of the flag might be different depending on the behavior.
For example, maybe it can be something like --cach-mode=yes/no/depends (obviously with better names).
Essentially I can see some users wanting to use DOWNLOAD for cache-aware downloads but have it fallback to RUN curl or RUN --no-cache curl behaviors if the server does not send the tags in the response, and then a third option of always disabling the cache aware behavior (even if the server supports the required tags in the response).

@eliottwiener
Copy link

I would probably use this if it were included. I like how Bazel's HTTP rules handle this. The cache key is effectively a combination of the "canonical ID" and the file checksum. The "canonical ID" may be set explicitly, and defaults to the URL. It is typical and recommended to include the expected checksum for the file to ensure the result is deterministic, and to mitigate the security risk. I would like it if DOWNLOAD provided similar parameters.

@eliottwiener
Copy link

Here's an example of an Earthly target that implements something like http_file:

VERSION 0.8

# http-download downloads a file from the given $URL via http/https.
# The file will be verified using the given $SHA256 checksum.
# This target maintains a cache to avoid re-downloading files.
# The download is cached using a combination of the $CANONICAL_ID (which
# defaults to $URL if not provided), and the given $SHA256 checksum.
# The download is performed with curl. You may use $CURL_FLAGS to modify the
# download behavior as needed.
http-download:
	FROM alpine:3
	CACHE /var/cache/apk
	RUN apk add curl
	CACHE --persist /download_cache
	ARG --required URL
	ARG --required SHA256
	ARG CANONICAL_ID="$URL"
	ARG CURL_FLAGS="--silent --fail --location --retry 3 --max-time 300"
	LET CACHE_PATH="/download_cache/$(printf '%s' "$CANONICAL_ID" | base64 -w 0)$SHA256"
	IF --no-cache ! test -f "$CACHE_PATH"
		LET TEMP_PATH="/downloaded_file"
		RUN --no-cache curl $CURL_FLAGS --output "$TEMP_PATH" "$URL"
		RUN --no-cache printf '%s  %s' "$SHA256" "$TEMP_PATH" | sha256sum -cw
		RUN --no-cache mv "$TEMP_PATH" "$CACHE_PATH"
	END
	SAVE ARTIFACT "$CACHE_PATH" /file

example-usage:
	FROM scratch
	COPY (+http-download/file --URL="https://github.com/earthly/earthly/releases/download/v0.8.11/earthly-linux-amd64" --SHA256=1515844da174e77f3c31d68634397c6c66812b47f9cbe22a73a43dec1dc48c98) /earthly-0-8-11
	COPY (+http-download/file --URL="https://github.com/earthly/earthly/releases/download/v0.8.10/earthly-linux-amd64" --SHA256=f07d640d49ebc2e50336068443f4eb5f123aa6e5eedd3acad859b1d7d2690a85) /earthly-0-8-10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:proposal A proposal for a new feature
Projects
Status: Todo
Development

No branches or pull requests

4 participants