Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New storage backend model #1214

Open
jrosdahl opened this issue Nov 7, 2022 · 9 comments
Open

New storage backend model #1214

jrosdahl opened this issue Nov 7, 2022 · 9 comments
Labels
feature New or improved feature

Comments

@jrosdahl
Copy link
Member

jrosdahl commented Nov 7, 2022

Background

Ccache currently has file, HTTP and Redis remote storage backends. The file and HTTP backends do not depend on any external libraries. The Redis backend depends on the small and ubiquitous Hiredis library. The remote storage backends are part of the ccache source tree and are compiled and linked statically with the ccache executable.

It would be very nice to support more protocols, like HTTPS (#890, #894), Redis over TLS (#902), Azure Blob Storage (#1152), AWS S3 (#1201), Google Cloud Storage (in case the HTTP/HTTPS backend does not suffice) and other cloud services and custom backends.

My approach has been to start out with only bundling backends that have no external dependencies, or only external dependencies that are ubiquitous and small enough. One reason for this is that I want to keep the startup of the ccache executable fast. For instance, linking with libcurl makes the startup a factor 4 slower on my system. Another aspect is that I would prefer not to have to maintain code that I can't easily test myself, such as backends for various cloud services. It would be much better if the people who are interested in a backend are the ones who maintain the code. (It's currently only me who is maintaining ccache and my spare time is not exactly abundant.) And a third aspect is one of distribution: I would like to be able to distribute a ccache package (for instance as part of a Linux distribution) that does not depend on libraries that are not needed for the basic use case (i.e., not using remote storage) and then have support for different remote storage backends in optional add-on packages. This is partly why I have been reluctant to add optional (at compile time) HTTPS support since I want a solution that does not depend on compile-time choices.

Another problem with the current backend framework is that ccache can't keep connections alive and thus not reuse sessions that are costly to set up.

Proposal

As mentioned in #894 (reply in thread), I propose that we make ccache automatically start a long-lived protocol-specific helper process (if not already started) and communicate with it over a Unix socket.

Here is a rough design sketch of how it could work, taking HTTPS as an example protocol:

  1. Say that ccache has been configured with remote_storage = https://user:[email protected]/path|param=value.
  2. Ccache connects to a Unix socket named something like ${CACHE_TEMPDIR}/backend-<name>.sock where <name> is a unique hash of the URL and applicable parameters.
    • This makes it easy to handle configuration changes: a new helper process will be started for the new configuration and the old helper process will terminate itself after a while.
  3. If the connection is refused or the socket doesn't exist:
    1. Ccache looks for an executable called ccache-backend-https in some (configurable) libexec location. Maybe also check in $PATH?
    2. Ccache starts ccache-backend-https as a background (daemon) process and passes the socket path, the URL and other configuration as environment variables.
    3. The helper process creates the socket and starts accepting connections to it.
    4. The helper process exits when it has been idle for some time. (10 minutes? 1? Could be configurable.)
  4. Ccache communicates with the helper process over the socket using some yet to be defined but simple protocol.
    • The protocol should ideally be simple enough that no special library is required. This opens up for implementing the backend in any language.
    • Unix sockets are also supported on Windows these days.

Advantages:

  • The helper process can keep connections alive, thus amortizing the session setup cost and avoiding flooding the server with one connection per compilation.
  • The startup of the ccache executable is kept fast.
  • Backends can depend on any libraries since the time to start a helper process is not very important.
  • Backend implementations can be part of ccache's code tree, or part of another project with a different release cycle or maintainership.
  • Backends can be built, packaged and distributed separately.
  • Backends can be implemented in any language.
  • Apart from installing the ccache-backend-<protocol> executable, there is no need to install, configure, start and monitor a separate daemon process. Things will Just Work with the same remote_storage configuration as before.

If this is implemented, the existing HTTP and Redis backends would be converted to the new mechanism. The file backend would still be kept as is.

@jrosdahl jrosdahl added the feature New or improved feature label Nov 7, 2022
This was referenced Nov 7, 2022
@afbjorklund
Copy link
Contributor

If this is implemented, the existing HTTP and Redis backends would be converted to the new mechanism. The file backend would still be kept as is.

I think the HTTP and Redis backend should also be kept as-is (due to not requiring external libraries), but that the HTTPS and Rediss (TLS) backends would have a different way of loading so that they can depend on OpenSSL and similar libraries...

But I thought that it would use dlopen (.so files) for this, rather than RPC ?

There were some thoughts about cmake implementation in #894 (comment)

@jrosdahl
Copy link
Member Author

jrosdahl commented Nov 8, 2022

I think the HTTP and Redis backend should also be kept as-is (due to not requiring external libraries)

Could you expand a bit on why you think that would be a good idea?

I'm thinking that it would be better to focus on a unified http+https implementation and a unified redis+rediss implementation.

From my point of view, my proposal would solve all issues I'm aware of with the current framework (and I wish that I had thought of that approach in #414). But it's of course so far only an untested idea, so it will need some testing to see if it flies.

but that the HTTPS and Rediss (TLS) backends would have a different way of loading so that they can depend on OpenSSL and similar libraries...

But I thought that it would use dlopen (.so files) for this, rather than RPC ?

Since dynamically loading code won't solve the problem with keeping sessions alive, I don't think that there is a need for a dlopen-based plugin system. What advantages do you see with doing it that way?

@afbjorklund
Copy link
Contributor

I think the HTTP and Redis backend should also be kept as-is (due to not requiring external libraries)

Could you expand a bit on why you think that would be a good idea?

I would like to see them "included" by default, otherwise I think they will just be unconfigured and uninstalled...

But I suppose that is already happening*, so it wouldn't change much from the current situation either way ?

* i.e. when using REDIS_STORAGE_BACKEND=OFF to disable the feature 😔

It seems unlikely that anything will replace NFS, at least for the enterprise.

Since dynamically loading code won't solve the problem with keeping sessions alive, I don't think that there is a need for a dlopen-based plugin system.

What advantages do you see with doing it that way?

It seemed like a simpler solution, even if it only solved half the problem (making life easier when not using it)

My thinking that there was room for both options, loading some plugins and setting up a storage backend proxy...

The current workaround was defining different backends in different binaries.

i.e. ccache had one set (small), and rpc-server had one set (loaded statically).

@jrosdahl
Copy link
Member Author

I would like to see them "included" by default, otherwise I think they will just be unconfigured and uninstalled...

But I suppose that is already happening*, so it wouldn't change much from the current situation either way ?

Yes. As long as HTTP and Redis backends are kept in the ccache source tree, http and redis support would be just as enabled or disabled as they are with the current backend model.

It seems unlikely that anything will replace NFS, at least for the enterprise.

Why do you believe that? And do you mean that this has any implications on how non-file ccache backends should work?

It seemed like a simpler solution, even if it only solved half the problem (making life easier when not using it)

My thinking that there was room for both options, loading some plugins and setting up a storage backend proxy...

OK. I think that sounds more complex than than my proposal, not simpler.

The current workaround was defining different backends in different binaries.

i.e. ccache had one set (small), and rpc-server had one set (loaded statically).

Right. Just to be clear: what I'm trying to describe in this issue is a design that I feel would be a "real" solution, not a workaround.

@afbjorklund

This comment was marked as off-topic.

GMNGeoffrey added a commit to iree-org/iree that referenced this issue Nov 30, 2022
This connects our CMake builds to a [ccache](https://ccache.dev/)
hosted in a GCS bucket. `ccache` newly (ish) supports using remote
storage for the cache! Currently it only supports Redis, FTP, and HTTP.
HTTPS is *not* supported right now, but there are plans to add an HTTPS
backend, as well as potentially a direct GCS backend (see
ccache/ccache#1214).

I think this adds a little bit of overhead for the network requests,
potentially increasing the time for building with a completely cold 
cache. 

An example `build_all` job with a completely cold cache took 13.2
minutes for the entire job, 10 minutes for just the build step, of
which 6.1 minutes was spent in the actual `cmake --build` command (not
including builds of the `install` or `iree-test-deps` targets, which
don't involve building C++):
https://github.com/iree-org/iree/actions/runs/3562697821/jobs/5984663663

Going through that commit's ancestors on the main branch, this looks
like it's adding about 30±30 seconds to the build, using the
statistical technique of "eyeballing".

We get wins on the flip side though, where with a fully cached build,
the times are 6.3m, 3.8m, 1.6m.

The impact is even bigger with asan, where we see the same ~50%
improvement on the already-slower build.

Unfortunately, since ccache is a language-specific cache, we can't
do the same trick with all the test artifacts.

The lack of HTTPS support does present somewhat of a problem because
GCP doesn't allow using unsecured HTTP for many API access scopes. I
ran into trouble with this when trying to get things to work locally
because the local gcloud credentials for a user account usually have
very broad scope (see discussion in
ccache/ccache#1001). But it *does* work fine on
our GCP VMs since those service accounts have much more limited
permissions. Luckily, we don't actually want users writing to the
cache, so this mostly just impacted me setting it up.

I also tried
[sccache](https://github.com/mozilla/sccache), which has a GCS backend,
but configuring the backend locally was pretty janky (see
mozilla/sccache#144 (comment)).
I ultimately went with ccache since it's the much more established
project and it seems like there's quite a bit of design work going in
to making it work well.

ccache also supports two caching layers (indeed this is the standard
setup), so devs could make use of the remote cache by setting a
single config/env variable to point at it and continue using their
local ccache as well. This will of course only work as long as their
local machine is sufficiently similar to the docker containers or they
choose to build within docker containers.

Co-authored-by: Scott Todd <[email protected]>
@afbjorklund
Copy link
Contributor

afbjorklund commented Jan 23, 2023

Ccache communicates with the helper process over the socket using some yet to be defined but simple protocol.

  • The protocol should ideally be simple enough that no special library is required.
    This opens up for implementing the backend in any language.

My suggestion was to use msgpack, which was an alternative to jsonrpc or protobuf.

  1. msgpack in "rpclib": https://github.com/rpclib/rpclib

    • not as known, but used by redis and fluentd
  2. json in "packio": https://github.com/qchateau/packio

    • complicated with binary data (base64?), C++-17
  3. protobuf in "grpc": https://github.com/grpc/grpc

    • requires code generation, and complex toolchain

It didn't require any special library, but a header-only implementation: (see languages)

By defining custom serialization for the ccache classes*, it was quite efficient to use.

  • Digest

  • util::Bytes

  • nonstd::span<const uint8_t>


  • Unix sockets are also supported on Windows these days.

It seems like using boost::asio is the standard solution, for the actual local sockets ?

Looking forward to seeing the new implementation, here was the old PoC one that I did:

@tru
Copy link
Contributor

tru commented Mar 21, 2023

Hi!

I came upon this issue when thinking about something similar. I recently added support for our internal CDN/Cache to ccache internally, which works just fine - but would never be anything we could or would upstream. There is also a problem with our init cost is pretty high.

So I was thinking in line with the suggestion above would solve both my problems:

  • I could create a external integration to our CDN without having to maintain a ccache fork.
  • We could just init once for a build instead of once per ccache invocation.

For windows - it seems like AF_UNIX is supported in latest versions of Windows 10 or Windows 11. That seems like it should be usable, but I have never used it myself. Named pipes are the other option.

I can vouch for rpclib though, it was initially developed by a former co-woker of mine and I know it's pretty solid. The upside with this approach is that the socket abstraction is in rpclib instead of having to implement all that, which can be pretty messy.

Let me know if there is anything I can do to help pushing this initiative forward.

@tru
Copy link
Contributor

tru commented Apr 6, 2023

A small update from our side here:

I wrote a small webserver using httplib that integrates with our internal CDN. With a few simple methods I know have basically what we outlined above but over TCP/HTTP instead of a UNIX socket.

This works fine, performance seems to be decent because the big latency in my case will be to pull from the CDN in any case. Just a option to consider since it was really easy to do and a "skeleton" using httplib could easily be developed for this purpose.

@enihcam

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New or improved feature
Projects
None yet
Development

No branches or pull requests

4 participants