Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cache before running "go list" #1511

Closed
linzhp opened this issue Dec 24, 2019 · 19 comments
Closed

Use cache before running "go list" #1511

linzhp opened this issue Dec 24, 2019 · 19 comments

Comments

@linzhp
Copy link
Contributor

linzhp commented Dec 24, 2019

Is your feature request related to a problem? Please describe.
When Athens handles "list" or "latest" commands, it always runs "go list -m -versions" to get the list of version (https://github.com/gomods/athens/blob/master/pkg/module/go_vcs_lister.go#L44). The "go list" command in turn will reach out to the VCS to get the list. This is not optimal because:

  1. It doesn't provide protection in case the VCS is down.
  2. It introduces unnecessary traffic to VCS. Versions are released infrequently, in the order of weeks for most packages. Most VCS calls to get the list of versions will return the same result in between.
  3. It slows down the performance of Athens. Git ls-remote operations are often slow.

Describe the solution you'd like
We can have Athens to cache the list of version for each package. Upon receiving a list request for a specific module, Athen will always look up the version cache first. It will return the list from cache if the list of versions of that module is updated recent enough. Athen only run go list -m -versions if there is no version cache for that module, or that cache is older than a configurable age. Getting the list of versions can be configured as an async operations, i.e., it happens after returning the old list from the cache. When a VCS server is down, Athens still serves the reasonably up-to-date list of versions to the client, with logging of the unavailability of VCS.

Describe alternatives you've considered
Alternatively, we can populate module cache for all semantic versions of a module that "go list -m -versions" returns, and record the timestamp of the most recent update for the module. Then storage.List() will serve as version cache. The downside is the module cache may store some versions that users never requested.

Additional context
Privately hosted VCS may not be able to handle as much traffic as public hosting sites like Github. In addition, we need protection from any VCS issues.

We may help implementing this feature.

@marwan-at-work
Copy link
Contributor

@linzhp that sounds good to me.

A couple of implementations I have in mind:

  1. The cache can live in the storage itself
  2. The cache can just be in memory using something like https://github.com/dgraph-io/ristretto (since we can memory-bind it)
  3. The cache can be a new storage type that is cache-oriented such as Redis and group/mem-cache

Option 1 is the most straight forward but requires updating our storage interface and all of its implementations.

Option 2 sounds the easiest and also logical since list-data is discardable. However, memory management becomes an issue.

Option 3 is probably my least favorite although I'd be curious to know if anyone thinks that it has some benefits that are not obvious.

@xytan0056
Copy link
Contributor

The cache can just be in memory using something like https://github.com/dgraph-io/ristretto (since we can memory-bind it)

One downside of this is it won't work in distributed environment. Because memory cache doesn't consult peers, there's a chance that users see different results of the same request.

@linzhp
Copy link
Contributor Author

linzhp commented Jan 14, 2020

That's a good point. Also Option 2 means each node needs to request a copy of the list from VCS, which involves more VCS traffic than Option 1. However, updating the storage interface and all its implementations sounds scary...

@marwan-at-work
Copy link
Contributor

One thing to realize is the following:

The "go list" command only runs in two endpoints: <import-path>/@v/list and <import/path>/@latest.

In other words, if you have a resolved go.mod file and you run a go build/get/etc, the go list should never be called.

Those two endpoints are only ever called when you are onboarding a new module to your system.

In a CI/CD environment, go list might only be called if you're installed a go binary on the fly, something like go get github.com/static/analysis/vet -- But IMHO, even in those cases you should probably just use a specific version such as go get github.com/static/analysis/[email protected].

Therefore, the remaining use case, as far as I can see, is when you're adding a new module during development and in that case you might want to have the most up to date list and not an outdated-but-cached list.

So this has been mainly the reason go list is not being cached yet. At least personally, I rarely need go list to be cached and by the time I do need go list, I certainly hope that it's not cached.

Furthermore, the TTL on go list has to considerably short (nothing more than 5 minutes), because it can be a really annoying experience to introduce a new version of your module, and not have it show up for more than 5 minutes.

I do recognize that it's a slow operation and caching it can lead to a nicer experience. But I'd love to hear a compelling reason for why speeding up this endpoint is a significant win

@linzhp
Copy link
Contributor Author

linzhp commented Jan 15, 2020

A bit context:

We generate Go code on the fly at build time. When we generate Go packages from the proto files of Apache Mesos , the import path will become github.com/apache/mesos/include/mesos. With Go 1.12, go mod tidy will download the latest version of github.com/apache/mesos without adding it to the go.mod file, because it doesn't contain any Go files. Next time we run go mod tidy, it still looks for the latest version. As a result, Athens will keep calling the go list again and again. In order to check the sanity of our go.mod file, we run go mod tidy on CI jobs, resulting a large traffic to the VCS.

With Go 1.13, go mod tidy will return an error for github.com/apache/mesos/include/mesos. We have to figure out a solution to handle these generated packages. So go mod tidy will no longer requesting latest version of github.com/apache/mesos. We can wait until we upgrade to Go 1.13 and deploy our solution, and monitor the VCS traffic.

Furthermore, the TTL on go list has to considerably short (nothing more than 5 minutes), because it can be a really annoying experience to introduce a new version of your module, and not have it show up for more than 5 minutes.

A new commit will take quite a while, maybe an hour, to be available from proxy.golang.org though. We can also make this cache feature configurable, so users can turn in off in their Athens instance

@marwan-at-work
Copy link
Contributor

@linzhp thanks for the context. The go mod tidy situation you described is very interesting. I don't have a full grasp of what's going there but I'd love to dig a bit more, probably outside of the context of this issue.

But to stay on topic, yes I imagine the cache will be configurable and most likely opt-in. What we can do is introduce it as an optional interface, similar to how http.ResponseWriter might also be an http.Flusher.

This way, we can check if a storage does implement a ListCacher and if so, we use it. If not, and the user has opted-in, we can fail or warn.

@avivdolev
Copy link

Another consideration - completely offline environment.
We described it here #1506

This way, we can check if a storage does implement a ListCacher and if so, we use it. If not, and the user has opted-in, we can fail or warn.

@marwan-at-work I think that even without implementing the ListCacher interface-to-be , these endpoints should return the local catalog of versions instead of an error when the VCS is unavailable, or at least have an optional configuration to do so.
That will enable pre-loading the storage manually.
Could be implemented by calling the storage.Cataloger interface (if exist) and manipulate its output?

@marwan-at-work
Copy link
Contributor

marwan-at-work commented Feb 7, 2020

Sorry everyone, work has been busy. I'll get back to reading and answering as soon as I can.

But I just thought of an idea that would also unblock anyone here that needs "offline mode" or "partial responses":

You can create a side-car GOPROXY that you set "GoBinaryEnvVars" to point to.

The GOPROXY will do 1 of the 2 things:

  1. For offline mode, never run "go list", just return an empty response always. For "go mod download" return an error, because if something is not in storage, and you're offline, there's no way to get something upstream.

  2. For partial answers, always run "go list" but return an empty array if "go list" fails.

Now the question becomes: is that solution too much to implement for users?

@linzhp this should definitely unblock you from using an Athens fork, but would you find that too much to maintain and prefer a config instead?

The implementation for the side-car GOPROXY should be fairly trivial as far as I can tell.

@linzhp
Copy link
Contributor Author

linzhp commented Feb 7, 2020

After upgrading to Go 1.13 and properly handling packages like github.com/apache/mesos, we no longer have this performance issue. So we removed our custom logic related to the list API. Closing this one.

@linzhp linzhp closed this as completed Feb 7, 2020
@arschles
Copy link
Member

@linzhp are you able now to get onto the mainline Athens branch?

@linzhp
Copy link
Contributor Author

linzhp commented Feb 11, 2020

Not yet. We still have to fork in order to:

@arschles
Copy link
Member

Got it. For #1450, I'll have another look and see if we can fix it faster. For metrics, if we can help, let me know. For your internal storage API, in the past we've tried to implement a generic storage driver. One try was based on gRPC and the other was an HTTP based implementation. Would either of those help?

@linzhp
Copy link
Contributor Author

linzhp commented Feb 11, 2020

Our internal storage API comes with its own Go client, so we implemented a storage.Backend to call that client and added a case here. I am not sure if there is a way to plugin a new storage.Backend without changing Athens' codebase.

I will work with @xytan0056 to identify gaps on metrics

@xytan0056
Copy link
Contributor

We currently use tally to directly emit metrics data to m3db server. However, the latter doesn't seem to have a compatible exporter. We can probably do some tweaks around our infra though. Need some time to research.

@arschles
Copy link
Member

arschles commented Feb 15, 2020

@linzhp we have had some folks try to build new & more generic backends so they could also run Athens with their specific backend without recompiling Athens. Tons of reading on this if you're interested: #1110, #1131, #1459, #1130, #1353. Somewhere in one of those PRs, we discussed a gRPC based API that Athens can use to performantly talk to a storage server.

If that would work for you, we can try to make something like that happen. I know that one of those PRs would also welcome it.

@linzhp
Copy link
Contributor Author

linzhp commented Feb 17, 2020

I think it's a good idea to have some plugin architecture to support custom storage backends without forking Athens. Keep me posted on the progress

@arschles
Copy link
Member

@linzhp will do. We would like some details on what specific functionality you would need to be able to use it at your company. For example, if we did a straightforward HTTP API that matches the Go proxy download API, would that be enough for you?

@xytan0056
Copy link
Contributor

xytan0056 commented Feb 20, 2020

@arschles #1131 is actually what we considered previously, however, when calling our internal storage using raw HTTP, we must specify some specific params in header or query param. This involves modifying go code with pieces specific to our company. It's possible to make the header from athens configurable, but I'm not sure how to do that right. (Least is make it customized in toml)
Another approach will be directly using S3 client, but then we'll need to hardcode the credentials in config.toml, which raises security concern.

@arschles
Copy link
Member

@xytan0056 what if you used the environment variables to configure s3 credentials?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants