New storage backend model #1214

jrosdahl · 2022-11-07T18:18:45Z

Background

Ccache currently has file, HTTP and Redis remote storage backends. The file and HTTP backends do not depend on any external libraries. The Redis backend depends on the small and ubiquitous Hiredis library. The remote storage backends are part of the ccache source tree and are compiled and linked statically with the ccache executable.

It would be very nice to support more protocols, like HTTPS (#890, #894), Redis over TLS (#902), Azure Blob Storage (#1152), AWS S3 (#1201), Google Cloud Storage (in case the HTTP/HTTPS backend does not suffice) and other cloud services and custom backends.

My approach has been to start out with only bundling backends that have no external dependencies, or only external dependencies that are ubiquitous and small enough. One reason for this is that I want to keep the startup of the ccache executable fast. For instance, linking with libcurl makes the startup a factor 4 slower on my system. Another aspect is that I would prefer not to have to maintain code that I can't easily test myself, such as backends for various cloud services. It would be much better if the people who are interested in a backend are the ones who maintain the code. (It's currently only me who is maintaining ccache and my spare time is not exactly abundant.) And a third aspect is one of distribution: I would like to be able to distribute a ccache package (for instance as part of a Linux distribution) that does not depend on libraries that are not needed for the basic use case (i.e., not using remote storage) and then have support for different remote storage backends in optional add-on packages. This is partly why I have been reluctant to add optional (at compile time) HTTPS support since I want a solution that does not depend on compile-time choices.

Another problem with the current backend framework is that ccache can't keep connections alive and thus not reuse sessions that are costly to set up.

Proposal

As mentioned in #894 (reply in thread), I propose that we make ccache automatically start a long-lived protocol-specific helper process (if not already started) and communicate with it over a Unix socket.

Here is a rough design sketch of how it could work, taking HTTPS as an example protocol:

Say that ccache has been configured with remote_storage = https://user:[email protected]/path|param=value.
Ccache connects to a Unix socket named something like ${CACHE_TEMPDIR}/backend-<name>.sock where <name> is a unique hash of the URL and applicable parameters.
- This makes it easy to handle configuration changes: a new helper process will be started for the new configuration and the old helper process will terminate itself after a while.
If the connection is refused or the socket doesn't exist:
1. Ccache looks for an executable called ccache-backend-https in some (configurable) libexec location. Maybe also check in $PATH?
2. Ccache starts ccache-backend-https as a background (daemon) process and passes the socket path, the URL and other configuration as environment variables.
3. The helper process creates the socket and starts accepting connections to it.
4. The helper process exits when it has been idle for some time. (10 minutes? 1? Could be configurable.)
Ccache communicates with the helper process over the socket using some yet to be defined but simple protocol.
- The protocol should ideally be simple enough that no special library is required. This opens up for implementing the backend in any language.
- Unix sockets are also supported on Windows these days.

Advantages:

The helper process can keep connections alive, thus amortizing the session setup cost and avoiding flooding the server with one connection per compilation.
The startup of the ccache executable is kept fast.
Backends can depend on any libraries since the time to start a helper process is not very important.
Backend implementations can be part of ccache's code tree, or part of another project with a different release cycle or maintainership.
Backends can be built, packaged and distributed separately.
Backends can be implemented in any language.
Apart from installing the ccache-backend-<protocol> executable, there is no need to install, configure, start and monitor a separate daemon process. Things will Just Work with the same remote_storage configuration as before.

If this is implemented, the existing HTTP and Redis backends would be converted to the new mechanism. The file backend would still be kept as is.

The text was updated successfully, but these errors were encountered:

afbjorklund · 2022-11-08T10:17:53Z

If this is implemented, the existing HTTP and Redis backends would be converted to the new mechanism. The file backend would still be kept as is.

I think the HTTP and Redis backend should also be kept as-is (due to not requiring external libraries), but that the HTTPS and Rediss (TLS) backends would have a different way of loading so that they can depend on OpenSSL and similar libraries...

But I thought that it would use dlopen (.so files) for this, rather than RPC ?

There were some thoughts about cmake implementation in #894 (comment)

jrosdahl · 2022-11-08T20:01:13Z

I think the HTTP and Redis backend should also be kept as-is (due to not requiring external libraries)

Could you expand a bit on why you think that would be a good idea?

I'm thinking that it would be better to focus on a unified http+https implementation and a unified redis+rediss implementation.

From my point of view, my proposal would solve all issues I'm aware of with the current framework (and I wish that I had thought of that approach in #414). But it's of course so far only an untested idea, so it will need some testing to see if it flies.

but that the HTTPS and Rediss (TLS) backends would have a different way of loading so that they can depend on OpenSSL and similar libraries...

But I thought that it would use dlopen (.so files) for this, rather than RPC ?

Since dynamically loading code won't solve the problem with keeping sessions alive, I don't think that there is a need for a dlopen-based plugin system. What advantages do you see with doing it that way?

afbjorklund · 2022-11-09T06:58:53Z

I think the HTTP and Redis backend should also be kept as-is (due to not requiring external libraries)

Could you expand a bit on why you think that would be a good idea?

I would like to see them "included" by default, otherwise I think they will just be unconfigured and uninstalled...

But I suppose that is already happening*, so it wouldn't change much from the current situation either way ?

* i.e. when using REDIS_STORAGE_BACKEND=OFF to disable the feature 😔

It seems unlikely that anything will replace NFS, at least for the enterprise.

Since dynamically loading code won't solve the problem with keeping sessions alive, I don't think that there is a need for a dlopen-based plugin system.

What advantages do you see with doing it that way?

It seemed like a simpler solution, even if it only solved half the problem (making life easier when not using it)

My thinking that there was room for both options, loading some plugins and setting up a storage backend proxy...

The current workaround was defining different backends in different binaries.

i.e. ccache had one set (small), and rpc-server had one set (loaded statically).

jrosdahl · 2022-11-13T19:39:01Z

I would like to see them "included" by default, otherwise I think they will just be unconfigured and uninstalled...

But I suppose that is already happening*, so it wouldn't change much from the current situation either way ?

Yes. As long as HTTP and Redis backends are kept in the ccache source tree, http and redis support would be just as enabled or disabled as they are with the current backend model.

It seems unlikely that anything will replace NFS, at least for the enterprise.

Why do you believe that? And do you mean that this has any implications on how non-file ccache backends should work?

It seemed like a simpler solution, even if it only solved half the problem (making life easier when not using it)

My thinking that there was room for both options, loading some plugins and setting up a storage backend proxy...

OK. I think that sounds more complex than than my proposal, not simpler.

The current workaround was defining different backends in different binaries.

i.e. ccache had one set (small), and rpc-server had one set (loaded statically).

Right. Just to be clear: what I'm trying to describe in this issue is a design that I feel would be a "real" solution, not a workaround.

This connects our CMake builds to a [ccache](https://ccache.dev/) hosted in a GCS bucket. `ccache` newly (ish) supports using remote storage for the cache! Currently it only supports Redis, FTP, and HTTP. HTTPS is *not* supported right now, but there are plans to add an HTTPS backend, as well as potentially a direct GCS backend (see ccache/ccache#1214). I think this adds a little bit of overhead for the network requests, potentially increasing the time for building with a completely cold cache. An example `build_all` job with a completely cold cache took 13.2 minutes for the entire job, 10 minutes for just the build step, of which 6.1 minutes was spent in the actual `cmake --build` command (not including builds of the `install` or `iree-test-deps` targets, which don't involve building C++): https://github.com/iree-org/iree/actions/runs/3562697821/jobs/5984663663 Going through that commit's ancestors on the main branch, this looks like it's adding about 30±30 seconds to the build, using the statistical technique of "eyeballing". We get wins on the flip side though, where with a fully cached build, the times are 6.3m, 3.8m, 1.6m. The impact is even bigger with asan, where we see the same ~50% improvement on the already-slower build. Unfortunately, since ccache is a language-specific cache, we can't do the same trick with all the test artifacts. The lack of HTTPS support does present somewhat of a problem because GCP doesn't allow using unsecured HTTP for many API access scopes. I ran into trouble with this when trying to get things to work locally because the local gcloud credentials for a user account usually have very broad scope (see discussion in ccache/ccache#1001). But it *does* work fine on our GCP VMs since those service accounts have much more limited permissions. Luckily, we don't actually want users writing to the cache, so this mostly just impacted me setting it up. I also tried [sccache](https://github.com/mozilla/sccache), which has a GCS backend, but configuring the backend locally was pretty janky (see mozilla/sccache#144 (comment)). I ultimately went with ccache since it's the much more established project and it seems like there's quite a bit of design work going in to making it work well. ccache also supports two caching layers (indeed this is the standard setup), so devs could make use of the remote cache by setting a single config/env variable to point at it and continue using their local ccache as well. This will of course only work as long as their local machine is sufficiently similar to the docker containers or they choose to build within docker containers. Co-authored-by: Scott Todd <[email protected]>

afbjorklund · 2023-01-23T09:55:28Z

Ccache communicates with the helper process over the socket using some yet to be defined but simple protocol.

The protocol should ideally be simple enough that no special library is required.
This opens up for implementing the backend in any language.

My suggestion was to use msgpack, which was an alternative to jsonrpc or protobuf.

msgpack in "rpclib": https://github.com/rpclib/rpclib
- not as known, but used by redis and fluentd
json in "packio": https://github.com/qchateau/packio
- complicated with binary data (base64?), C++-17
protobuf in "grpc": https://github.com/grpc/grpc
- requires code generation, and complex toolchain

It didn't require any special library, but a header-only implementation: (see languages)

https://github.com/msgpack/msgpack-c/tree/cpp-2.1.5/include

By defining custom serialization for the ccache classes*, it was quite efficient to use.

Digest
util::Bytes
nonstd::span<const uint8_t>

Unix sockets are also supported on Windows these days.

It seems like using boost::asio is the standard solution, for the actual local sockets ?

https://github.com/boostorg/asio/tree/boost-1.58.0/include

Looking forward to seeing the new implementation, here was the old PoC one that I did:

Remote storage backend proxy #1212 (comment)

tru · 2023-03-21T07:24:05Z

Hi!

I came upon this issue when thinking about something similar. I recently added support for our internal CDN/Cache to ccache internally, which works just fine - but would never be anything we could or would upstream. There is also a problem with our init cost is pretty high.

So I was thinking in line with the suggestion above would solve both my problems:

I could create a external integration to our CDN without having to maintain a ccache fork.
We could just init once for a build instead of once per ccache invocation.

For windows - it seems like AF_UNIX is supported in latest versions of Windows 10 or Windows 11. That seems like it should be usable, but I have never used it myself. Named pipes are the other option.

I can vouch for rpclib though, it was initially developed by a former co-woker of mine and I know it's pretty solid. The upside with this approach is that the socket abstraction is in rpclib instead of having to implement all that, which can be pretty messy.

Let me know if there is anything I can do to help pushing this initiative forward.

tru · 2023-04-06T12:40:34Z

A small update from our side here:

I wrote a small webserver using httplib that integrates with our internal CDN. With a few simple methods I know have basically what we outlined above but over TCP/HTTP instead of a UNIX socket.

This works fine, performance seems to be decent because the big latency in my case will be to pull from the CDN in any case. Just a option to consider since it was really easy to do and a "skeleton" using httplib could easily be developed for this purpose.

jrosdahl added the feature New or improved feature label Nov 7, 2022

This was referenced Nov 7, 2022

Remote storage backend proxy #1212

Closed

S3 storage backend #1201

Open

This comment was marked as off-topic.

Sign in to view

afbjorklund mentioned this issue Nov 25, 2022

Azure blob secondary storage #1152

Open

GMNGeoffrey mentioned this issue Nov 29, 2022

Use ccache in CI iree-org/iree#11311

Merged

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New storage backend model #1214

New storage backend model #1214

jrosdahl commented Nov 7, 2022

afbjorklund commented Nov 8, 2022

jrosdahl commented Nov 8, 2022

afbjorklund commented Nov 9, 2022

jrosdahl commented Nov 13, 2022

This comment was marked as off-topic.

afbjorklund commented Jan 23, 2023 •

edited

tru commented Mar 21, 2023

tru commented Apr 6, 2023

This comment was marked as off-topic.

New storage backend model #1214

New storage backend model #1214

Comments

jrosdahl commented Nov 7, 2022

Background

Proposal

afbjorklund commented Nov 8, 2022

jrosdahl commented Nov 8, 2022

afbjorklund commented Nov 9, 2022

jrosdahl commented Nov 13, 2022

This comment was marked as off-topic.

afbjorklund commented Jan 23, 2023 • edited

tru commented Mar 21, 2023

tru commented Apr 6, 2023

This comment was marked as off-topic.

afbjorklund commented Jan 23, 2023 •

edited