Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpack TF plugins in a more atomic way. #33479

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mplzik
Copy link

@mplzik mplzik commented Jul 6, 2023

The original implementation of plugin cache handling unpacks plugin
archives in a way that can result in race conditions when multiple
terraform processes are accessing the same plugin cache. Since the
archives are being decompressed in chunks and output files are written
directly to the cache, we observed following manifestations of the issue:

  • text file busy errors if other terraform processes are already
    running the plugin and another process tries to replace it.
  • various plugin checksum errors triggered likely by simultaneous checksumming
    and rewriting the file.

This PR changes the zip archives with plugins are handled:

  1. Instead of writing directly to the target directory,
    installFromLocalArchive is now writing to a temporary staging
    directory prefixed with.temp that resides in the plugin cache dir
    to ensure this is on the same filesystem.
  2. After unpacking, the directory structure of the staging directory is
    replicated in the targetDir. The directories are created as needed
    and any files are moved using os.Rename. After this, the staging
    directory is removed.
  3. Since the temporary staging directories can be picked up by
    SearchLocalDirectory and make it fail in the situation when they're
    removed during directory traversal, the function has been modified to
    skip any entry that starts with dot.

Signed-off-by: Milan Plzik [email protected]

Fixes #31964

Target Release

1.5.x

Draft CHANGELOG entry

BUG FIXES

  • Concurrent write access to plugin cache directory has been fixed.

@hashicorp-cla
Copy link

hashicorp-cla commented Jul 6, 2023

CLA assistant check
All committers have signed the CLA.

@mplzik mplzik force-pushed the mplzik/tf-more-atomic-plugin-cache branch from 4c546ba to 6f7e411 Compare July 6, 2023 10:41
@mplzik mplzik marked this pull request as draft July 6, 2023 12:09
@mplzik mplzik force-pushed the mplzik/tf-more-atomic-plugin-cache branch from 6f7e411 to dabf85a Compare July 6, 2023 13:25
@mplzik mplzik marked this pull request as ready for review July 6, 2023 14:25
@crw crw added the enhancement label Jul 6, 2023
@crw
Copy link
Contributor

crw commented Jul 6, 2023

Thanks for this submission, I will raise it in triage.

@brandon-fryslie
Copy link

Oh please this would make me so happy, thanks

@crw
Copy link
Contributor

crw commented Jul 31, 2023

Thanks for the reminder, and thanks @mplzik again for the submission.

The review from triage is that this change would require more due diligence. One issue with this implementation is that it relies on os.rename being atomic, which it is not. This may make the concurrency issue worse on different platforms.

We also noticed that the documentation should call out more clearly that the plugin cache is not concurrency-safe.

I'll leave this PR open for discussion for the moment, but will likely close it after a time as the feeling was it would be better to redesign a concurrency-safe plugin cache, which would require more discussion and prioritization.

Thanks again for this submission!

@crw crw added the waiting-response An issue/pull request is waiting for a response from the community label Jul 31, 2023
@mplzik
Copy link
Author

mplzik commented Aug 4, 2023

@crw I definitely agree that this is not solution to all the problems -- hence my comment // On supported platforms, this should perform atomic replacement of the file., with supported meaning platforms where this is supported for os.Rename. Speaking of this, would hiding this change behind a (default off) flag would make this PR more acceptable? I understand that implementing a really atomic cache would be preferred, but likely also entails much bigger engineering time investment (and it hasn't been implemented for a longer time, suggesting it might not be trivial).

Let me also do a bit of an argument in favor of merging this PR from the risk assessment side.

The golang documentation is not going too far on describing os.Rename non-atomicity, but https://stackoverflow.com/questions/30385225/is-there-an-os-independent-way-to-atomically-overwrite-a-file goes a bit farther in describing the platform-specific behavior. Specifically (assuming files on the same volume):

I see two main differences:

  • The renaming approach uses more disk space temporarily, so some real-world setups that are on almost full disks might be tipped to exceeding available disk space. This might, however, happen also with increasing size of the providers.
  • In the windows case, the file might be deleted (and thus stop existing) for a short period of time if atomic rename is not supported. I believe this will trigger mostly the same issues that are manifesting now (although I can't provide more data to prove that -- our infrastructure is not using any Windows machines to run Terraform).

@joaocc
Copy link
Contributor

joaocc commented Aug 24, 2023

Hi. Just some 0.02€ from userland. We (and many others using terraform directly or via wrappers - in my case terragrunt) are having a terrible and very painfull time due to the concurrency issues, stopping us from moving ahead, and causing uncertainty in processes which should be quite trivial.
I would tend to agree with @mplzik in that, even if this is not 100% bulletprof, the fact that it is supposed to work much better on linux and (from what I understood) on Windows, means that a huge number of users should have their experience improved, and get back to using terraform in a productive and streamlined manner.
If in the future there is an "even better way" (TM) to improve the cache in more complex scenarios (like network file systems), then it would be also very welcomed, of course. But in the meantime, the benefits would already be rolled out.
Thanks

@crw crw removed the waiting-response An issue/pull request is waiting for a response from the community label Aug 24, 2023
@brandon-fryslie
Copy link

I'm failing to understand the rationale behind blocking a small, simple change that would fix the problem for the vast majority of users in favor of holding out for an unknown, unplanned bulletproof solution that (likely) no one is working on. I'm finding it difficult to believe this could have been fixed for months already and yet it just sits here getting stale.

You can always revert the PR when you implement the really badass perfect plugin cache that you're now committed to implementing. And if you're not committed to it, why block this? Can we at least get a timeline of when you expect to release the real fix?

@crw
Copy link
Contributor

crw commented Dec 12, 2023

First, apologies to @mplzik for the late response on this, the team appreciates your thoughtful response on this PR. The team is still concerned about edge cases with this solution. Speaking based on my own observations of how the maintainer team considers PRs, I think this is a difficult section of code to change from outside due to plugin cache management being in the critical path for every user.

@brandon-fryslie, to answer your question, the rationale is that the primary path to run Terraform is in sequence, not in parallel. The plugin cache is not concurrency safe. You are correct to assume no one is working on a more robust solution, as making the plugin cache safe for concurrent execution of Terraform is not currently a priority for the team. I'll re-raise this PR with our product manager to see if concurrency-safety for the plugin cache is something we can prioritize. Given that Terraform, as a whole, is not meant to be run in parallel, this likely has considerations beyond the plugin cache. However, it is always possible we will need to prioritize this as we add new features going forward.

Thanks for your feedback and continued interest in this feature.

@expnch
Copy link

expnch commented Dec 12, 2023

the primary path to run Terraform is in sequence, not in parallel.

Just to chime in - running Terraform in parallel (for different projects that use the same modules) is a huge part of our CI/CD workflow.

return authResult, fmt.Errorf("failed to create new directory: %w", err)
}

stagingDir, err := os.MkdirTemp(path.Dir(targetDir), ".temp")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we could get rid of the need to refactor the SearchLocalDirectory function if we wrote the files into a directory created by os.TempDir() instead of inside our cache? Any reason not to do that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reason not to do that is that the temporary filesystem could be on a different device; therefore, os.Rename could fail. golang/go#13766

@crw crw added the waiting-response An issue/pull request is waiting for a response from the community label Jul 18, 2024
@wosiu
Copy link

wosiu commented Nov 19, 2024

@crw I can see "waiting-response" label added. What kind of response from the community is expected for this one to be merged? Is it only about this comment: #33479 (comment)?

Asking because if we're close to merge, then I guess we can find community support to finish this.

@crw
Copy link
Contributor

crw commented Nov 19, 2024

@wosiu: @liamcervante, who is a maintainer of Terraform, left a comment on the code. As far as I can tell, the author has not returned to comment on or change that section of code.

@mplzik
Copy link
Author

mplzik commented Nov 20, 2024

Hello there. The TF use-case I was dealing with when proposing this PR was tied to Atlantis. Meanwhile runatlantis/atlantis#3720 got merged, allowing us to work around this issue from different angle (and empirically, we have not noticed this issue since deploying the Atlantis change). As for the Atlantis use-case, I think this PR is no longer necessary.

If there is any interest in having this merged, I can try to give this PR a shot, but to be totally frank, my interest to contribute decreased significantly after the TF licensing changes -- hence the long delay and PR not updated.

@crw
Copy link
Contributor

crw commented Nov 20, 2024

Thanks for the update, @mplzik, and I understand your concerns.

@crw crw removed the waiting-response An issue/pull request is waiting for a response from the community label Nov 20, 2024
@crw
Copy link
Contributor

crw commented Nov 20, 2024

Given that update, I discussed with @liamcervante. If someone else in the community wants to pick this up and address Liam's comments (or if you can convince @mplzik to do it :), that submission would be considered for review. Thanks!

@mplzik
Copy link
Author

mplzik commented Nov 20, 2024

My primary concern -- once merged, would this/similar patch be applicable to OpenTofu without any legal consequences? I'm not a lawyer and thus, don't feel keen enough to dive into details of license compatibility.

@crw
Copy link
Contributor

crw commented Nov 21, 2024

Valid concern. I am working to get an answer, however, the most meaningful answer would likely come from an outside review of the Contributor License Agreement and the Business Source License.

@mplzik
Copy link
Author

mplzik commented Nov 21, 2024

Note that the PR was created before any TF licensing changes happened, which complicates the matters a bit farther. Also, thanks a lot for getting the answers. :)

pietrodn added a commit to pietrodn/terraform that referenced this pull request Nov 23, 2024
The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the `terraform init` command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time.
This is a serious problem in CI workflows, as it forces developers to choose between initializing modules serially and using the cache, or initializing modules in parallel, but download the same modules every time, increasing the consumed bandwidth.

This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location.

This change is inspired by this PR: hashicorp#33479
@pietrodn
Copy link

Hi! I created an updated and simplified PR here: #36085

I did not include the suggested change by @liamcervante because the filesystem of the temporary directory may differ from the one of the provider cache. In that case, the system call to move the files will fail.

Instead, I'm creating a temporary directory inside the plugin cache directory.

pietrodn added a commit to pietrodn/terraform that referenced this pull request Nov 23, 2024
The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the `terraform init` command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time.
This is a serious problem in CI workflows, as it forces developers to choose between initializing modules serially and using the cache, or initializing modules in parallel, but download the same modules every time, increasing the consumed bandwidth.

This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location.

This change is inspired by this PR: hashicorp#33479
@mplzik
Copy link
Author

mplzik commented Nov 23, 2024

@pietrodn looking at your PR, it looks it's fairly similar to mine -- if it's based on it, the licensing concerns might apply to it as well. Another issue is that the original comments from @crw (#33479 (comment)) also stated that os.Rename is behaving diffently on different platforms and thus refactor might introduce some unexpected behavior. My proposal there was to hide this behind a flag (which I don't know whether is still acceptable). There might be some discussion left before this PR is mergeable (at least that is my interpretation of the discussion).

Also, I'm not able to sufficiently test PR anymore, so we're left with anything that CI/CD might detect.

The original implementation of plugin cache handling unpacks plugin
archives in a way that can result in race conditions when multiple
terraform processes are accessing the same plugin cache. Since the
archives are being decompressed in chunks and output files are written
directly to the cache, we observed following manifestations of the issue:

- `text file busy` errors if other terraform processes are already
  running the plugin and another process tries to replace it.
- various plugin checksum errors triggered likely by simultaneous checksumming
  and rewriting the file.

This PR changes the zip archives with plugins are handled:

1. Instead of writing directly to the target directory,
   `installFromLocalArchive` is now writing to a temporary staging
   directory prefixed with`.temp` that resides in the plugin cache dir
   to ensure this is on the same filesystem.
2. After unpacking, the directory structure of the staging directory is
   replicated in the `targetDir`. The directories are created as needed
   and any files are moved using `os.Rename`. After this, the staging
   directory is removed.
3. Since the temporary staging directories can be picked up by
   `SearchLocalDirectory` and make it fail in the situation when they're
   removed during directory traversal, the function has been modified to
   skip any entry that starts with dot.

Signed-off-by: Milan Plzik <[email protected]>
@mplzik mplzik force-pushed the mplzik/tf-more-atomic-plugin-cache branch from dabf85a to bd05f55 Compare November 23, 2024 19:33
@mplzik mplzik requested a review from a team as a code owner November 23, 2024 19:33
@pietrodn
Copy link

I'm looking at how uv (a modern Python package manager) handles its global cache of Python packages.

Apparently, they use a temporary file and then move it into place with a rename with retry function, that is a regular rename syscall for POSIX, and a rename with exponential backoff for Windows (ref).

Here is the higher-level function to download a wheel into a temporary directory and call the persist() function with the exponential backoff file rename.

@pietrodn
Copy link

I've implemented the rename-with-retry method in a second commit, let me know if you like that as a solution to the Windows issue.

@mplzik
Copy link
Author

mplzik commented Nov 24, 2024

@pietrodn reading closely Microsoft documentation, this approach might reduce the chances of race conditions happening, but won't fix the issue -- IIUC the move operation on Windows might not be atomic under the hood and any terraform invocations running in parallel might still catch an incomplete file (e.g. especially during the exponential backoff phase) -- I believe this is why #33479 (comment) suggested refactoring the cache handling instead of using this code directly (and was my second reason why I stopped updating this PR -- @crw, was there any change on the opinion of refactoring the plugin cache handling code?).

Ideally, the cache should have a lockfile (or lockfiles) to avoid any concurrent updates, but that would be a bit bigger undertaking.

@pietrodn
Copy link

I agree, but I don't see how merging your PR or mine would make Windows users worse off. Currrently, they are already forced to either not use the plugin cache or init their modules sequentially. The change would, at least, fix the issue for the other platforms.

@mplzik
Copy link
Author

mplzik commented Nov 24, 2024

Neither of the PRs fixes the problem 100% as it is; your PR statistically improves the chances of problems not happening on Windows as well. Not long ago this PR was in a state where (to my best understanding) this was a root of the technical hurdles to accept this PR.

To be totally frank -- I'm not sure why are you opening another PR with code that is mostly based on this one. I wouldn't object if I completely abandoned the PR and wouldn't be answering, but I am answering the PR and discussing things and trying to find out reasonable technical and legal compromises. Non-coordinated parallel efforts waste human time.

And, lastly -- totally non-technical note: Having invested some time to dig through the Terraform codebase and fix this (despite not moving forward with the PR, thinking it's not acceptable and got license-wise complicated), from my point of view this looks to me as having this PR forked and submitted as your own -- which doesn't sound like a fair play. Chances are this might be just an accident, you starting your work on this before noticing my comments, and that's totally understandable, but let's figure a proper way to fix this (and give proper credits to all people involved along the way).

@pietrodn
Copy link

This feature is something I need for my work as it would greatly speed up some CI processes.
I saw this PR being not worked on, apparently only blocked by this comment by @liamcervante.

@crw then wrote:

If someone else in the community wants to pick this up and address Liam's comments (or if you can convince @mplzik to do it :), that submission would be considered for review. Thanks!

So I picked it up and I did it. Since I can't push to your repo, I had no other choice that doing my own. While working on it I realized that using the temporary file system would not work (as it would break Os.Rename() on different file systems). Since I had already done the work and tested it, I thought it would be useful to push a rebased and updated version of your change so that it could be merged.

I never intended to breach your intellectual property or to take credit. In all honesty, the contribution I did is trivial and I don't even care about attribution (or licensing of this change in general). I just care about this problem being fixed for Linux.

In your opinion, what's the best path forward to have a simple fix merged on Terraform mainline? I can commit a few hours of my time on trying to solve the problem. If you prefer that I close my PR and we work only on this, so be it—it's not a problem. 😄

@mplzik
Copy link
Author

mplzik commented Nov 24, 2024

Hey, first of all, we're all here to fix the annoying TF bug :-) . Apologies if I sounded too grumpy (getting rid of bad cold); that wasn't something I wanted to see in the first place. I don't have a best way forward to have this merged due to objections expressed by hashicorp folks earlier in the comments.

Again, I'd like to wait for @crw's answer, but I did a small research to make the call about what the requirements are:

  • as for concerns from Unpack TF plugins in a more atomic way. #33479 (comment), IIUC either of our PRs can satisfy that by having an explicit opt-in flag or envvar. That way, this PR wouldn't change the critical path unless the end users opt in explicitly. Ideally, a couple of if os.GetEnv("OPT_IN_ENV") { /* fixed code */ else { /* original code */ } might do the trick, although it's not nice.
  • a more reasonable would probably be to take a step back and use filesystem-level locking and use lockfiles or similar mechanism to ensure that there's only one process writing the file and any process only reads the file once it's been written completely. This would avoid any race conditions inherently. I'm afraid that despite of this being a more (academically) valid solution, there would still need to be some kind of opt-in.
  • get more clear idea from @crw on any other preferred course action from Hashicorp's side (w.r.t. Unpack TF plugins in a more atomic way. #33479 (comment)).

Since this is a PR older than the license switch, I'd love to be able to contribute this to opentofu as well, if possible -- that's I mentioned my license concerns. It's not a showstopper (I don't want to block PR that's publicly visible either way), but I think users of both projects deserve to have this fixed.

@pietrodn
Copy link

I also thought about file locks. The Terraform codebase has already an implementation of file locks (for both UNIX and Windows!), so it could be feasible to extract it and use that.

@mplzik
Copy link
Author

mplzik commented Nov 24, 2024

Nice! One of my main concerns here was that introducing an additional library (I found https://github.com/danjacques/gofslock) might again get problematic license-wise and a DIY solution would sidetrack the fix quite a lot.

@pietrodn
Copy link

pietrodn commented Nov 24, 2024

In #36085 I wrote a proof of concept implementation for the file locks. I ran it on a Terraform module monorepo that gave trouble when initializing with a shared cache, and the file lock seems to solve the problem (the file rename alone didn't).

Perhaps the tryFileLock mechanism needs to be made more robust but it's there to take some early feedback. :)

@crw
Copy link
Contributor

crw commented Dec 2, 2024

Thanks for the continued thought on this issue. Unfortunately I do not have an update to the license question, the person (lawyer) who can offer an assessment on the IP implications is on a long vacation, so we are waiting his return.

In the meantime I will bring this PR and the associated new PRs up once again in triage to see if there is any change on willingness to review. Thanks!

@mplzik
Copy link
Author

mplzik commented Dec 3, 2024

Hello @crw; I believe #36085 is in a good way to make this PR obsolete and has drifted from this PR significantly enough for @pietrodn to make the licensing call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow multiple Terraform instances to write to plugin_cache_dir concurrently
10 participants