-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unpack TF plugins in a more atomic way. #33479
base: main
Are you sure you want to change the base?
Conversation
4c546ba
to
6f7e411
Compare
6f7e411
to
dabf85a
Compare
Thanks for this submission, I will raise it in triage. |
Oh please this would make me so happy, thanks |
Thanks for the reminder, and thanks @mplzik again for the submission. The review from triage is that this change would require more due diligence. One issue with this implementation is that it relies on We also noticed that the documentation should call out more clearly that the plugin cache is not concurrency-safe. I'll leave this PR open for discussion for the moment, but will likely close it after a time as the feeling was it would be better to redesign a concurrency-safe plugin cache, which would require more discussion and prioritization. Thanks again for this submission! |
@crw I definitely agree that this is not solution to all the problems -- hence my comment Let me also do a bit of an argument in favor of merging this PR from the risk assessment side. The golang documentation is not going too far on describing
I see two main differences:
|
Hi. Just some 0.02€ from userland. We (and many others using terraform directly or via wrappers - in my case terragrunt) are having a terrible and very painfull time due to the concurrency issues, stopping us from moving ahead, and causing uncertainty in processes which should be quite trivial. |
I'm failing to understand the rationale behind blocking a small, simple change that would fix the problem for the vast majority of users in favor of holding out for an unknown, unplanned bulletproof solution that (likely) no one is working on. I'm finding it difficult to believe this could have been fixed for months already and yet it just sits here getting stale. You can always revert the PR when you implement the really badass perfect plugin cache that you're now committed to implementing. And if you're not committed to it, why block this? Can we at least get a timeline of when you expect to release the real fix? |
First, apologies to @mplzik for the late response on this, the team appreciates your thoughtful response on this PR. The team is still concerned about edge cases with this solution. Speaking based on my own observations of how the maintainer team considers PRs, I think this is a difficult section of code to change from outside due to plugin cache management being in the critical path for every user. @brandon-fryslie, to answer your question, the rationale is that the primary path to run Terraform is in sequence, not in parallel. The plugin cache is not concurrency safe. You are correct to assume no one is working on a more robust solution, as making the plugin cache safe for concurrent execution of Terraform is not currently a priority for the team. I'll re-raise this PR with our product manager to see if concurrency-safety for the plugin cache is something we can prioritize. Given that Terraform, as a whole, is not meant to be run in parallel, this likely has considerations beyond the plugin cache. However, it is always possible we will need to prioritize this as we add new features going forward. Thanks for your feedback and continued interest in this feature. |
Just to chime in - running Terraform in parallel (for different projects that use the same modules) is a huge part of our CI/CD workflow. |
return authResult, fmt.Errorf("failed to create new directory: %w", err) | ||
} | ||
|
||
stagingDir, err := os.MkdirTemp(path.Dir(targetDir), ".temp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we could get rid of the need to refactor the SearchLocalDirectory
function if we wrote the files into a directory created by os.TempDir()
instead of inside our cache? Any reason not to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A reason not to do that is that the temporary filesystem could be on a different device; therefore, os.Rename
could fail. golang/go#13766
@crw I can see "waiting-response" label added. What kind of response from the community is expected for this one to be merged? Is it only about this comment: #33479 (comment)? Asking because if we're close to merge, then I guess we can find community support to finish this. |
@wosiu: @liamcervante, who is a maintainer of Terraform, left a comment on the code. As far as I can tell, the author has not returned to comment on or change that section of code. |
Hello there. The TF use-case I was dealing with when proposing this PR was tied to Atlantis. Meanwhile runatlantis/atlantis#3720 got merged, allowing us to work around this issue from different angle (and empirically, we have not noticed this issue since deploying the Atlantis change). As for the Atlantis use-case, I think this PR is no longer necessary. If there is any interest in having this merged, I can try to give this PR a shot, but to be totally frank, my interest to contribute decreased significantly after the TF licensing changes -- hence the long delay and PR not updated. |
Thanks for the update, @mplzik, and I understand your concerns. |
Given that update, I discussed with @liamcervante. If someone else in the community wants to pick this up and address Liam's comments (or if you can convince @mplzik to do it :), that submission would be considered for review. Thanks! |
My primary concern -- once merged, would this/similar patch be applicable to OpenTofu without any legal consequences? I'm not a lawyer and thus, don't feel keen enough to dive into details of license compatibility. |
Valid concern. I am working to get an answer, however, the most meaningful answer would likely come from an outside review of the Contributor License Agreement and the Business Source License. |
Note that the PR was created before any TF licensing changes happened, which complicates the matters a bit farther. Also, thanks a lot for getting the answers. :) |
The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the `terraform init` command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time. This is a serious problem in CI workflows, as it forces developers to choose between initializing modules serially and using the cache, or initializing modules in parallel, but download the same modules every time, increasing the consumed bandwidth. This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location. This change is inspired by this PR: hashicorp#33479
Hi! I created an updated and simplified PR here: #36085 I did not include the suggested change by @liamcervante because the filesystem of the temporary directory may differ from the one of the provider cache. In that case, the system call to move the files will fail. Instead, I'm creating a temporary directory inside the plugin cache directory. |
The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the `terraform init` command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time. This is a serious problem in CI workflows, as it forces developers to choose between initializing modules serially and using the cache, or initializing modules in parallel, but download the same modules every time, increasing the consumed bandwidth. This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location. This change is inspired by this PR: hashicorp#33479
@pietrodn looking at your PR, it looks it's fairly similar to mine -- if it's based on it, the licensing concerns might apply to it as well. Another issue is that the original comments from @crw (#33479 (comment)) also stated that os.Rename is behaving diffently on different platforms and thus refactor might introduce some unexpected behavior. My proposal there was to hide this behind a flag (which I don't know whether is still acceptable). There might be some discussion left before this PR is mergeable (at least that is my interpretation of the discussion). Also, I'm not able to sufficiently test PR anymore, so we're left with anything that CI/CD might detect. |
The original implementation of plugin cache handling unpacks plugin archives in a way that can result in race conditions when multiple terraform processes are accessing the same plugin cache. Since the archives are being decompressed in chunks and output files are written directly to the cache, we observed following manifestations of the issue: - `text file busy` errors if other terraform processes are already running the plugin and another process tries to replace it. - various plugin checksum errors triggered likely by simultaneous checksumming and rewriting the file. This PR changes the zip archives with plugins are handled: 1. Instead of writing directly to the target directory, `installFromLocalArchive` is now writing to a temporary staging directory prefixed with`.temp` that resides in the plugin cache dir to ensure this is on the same filesystem. 2. After unpacking, the directory structure of the staging directory is replicated in the `targetDir`. The directories are created as needed and any files are moved using `os.Rename`. After this, the staging directory is removed. 3. Since the temporary staging directories can be picked up by `SearchLocalDirectory` and make it fail in the situation when they're removed during directory traversal, the function has been modified to skip any entry that starts with dot. Signed-off-by: Milan Plzik <[email protected]>
dabf85a
to
bd05f55
Compare
I'm looking at how Apparently, they use a temporary file and then move it into place with a rename with retry function, that is a regular rename syscall for POSIX, and a rename with exponential backoff for Windows (ref). Here is the higher-level function to download a wheel into a temporary directory and call the |
I've implemented the rename-with-retry method in a second commit, let me know if you like that as a solution to the Windows issue. |
@pietrodn reading closely Microsoft documentation, this approach might reduce the chances of race conditions happening, but won't fix the issue -- IIUC the move operation on Windows might not be atomic under the hood and any terraform invocations running in parallel might still catch an incomplete file (e.g. especially during the exponential backoff phase) -- I believe this is why #33479 (comment) suggested refactoring the cache handling instead of using this code directly (and was my second reason why I stopped updating this PR -- @crw, was there any change on the opinion of refactoring the plugin cache handling code?). Ideally, the cache should have a lockfile (or lockfiles) to avoid any concurrent updates, but that would be a bit bigger undertaking. |
I agree, but I don't see how merging your PR or mine would make Windows users worse off. Currrently, they are already forced to either not use the plugin cache or init their modules sequentially. The change would, at least, fix the issue for the other platforms. |
Neither of the PRs fixes the problem 100% as it is; your PR statistically improves the chances of problems not happening on Windows as well. Not long ago this PR was in a state where (to my best understanding) this was a root of the technical hurdles to accept this PR. To be totally frank -- I'm not sure why are you opening another PR with code that is mostly based on this one. I wouldn't object if I completely abandoned the PR and wouldn't be answering, but I am answering the PR and discussing things and trying to find out reasonable technical and legal compromises. Non-coordinated parallel efforts waste human time. And, lastly -- totally non-technical note: Having invested some time to dig through the Terraform codebase and fix this (despite not moving forward with the PR, thinking it's not acceptable and got license-wise complicated), from my point of view this looks to me as having this PR forked and submitted as your own -- which doesn't sound like a fair play. Chances are this might be just an accident, you starting your work on this before noticing my comments, and that's totally understandable, but let's figure a proper way to fix this (and give proper credits to all people involved along the way). |
This feature is something I need for my work as it would greatly speed up some CI processes. @crw then wrote:
So I picked it up and I did it. Since I can't push to your repo, I had no other choice that doing my own. While working on it I realized that using the temporary file system would not work (as it would break I never intended to breach your intellectual property or to take credit. In all honesty, the contribution I did is trivial and I don't even care about attribution (or licensing of this change in general). I just care about this problem being fixed for Linux. In your opinion, what's the best path forward to have a simple fix merged on Terraform mainline? I can commit a few hours of my time on trying to solve the problem. If you prefer that I close my PR and we work only on this, so be it—it's not a problem. 😄 |
Hey, first of all, we're all here to fix the annoying TF bug :-) . Apologies if I sounded too grumpy (getting rid of bad cold); that wasn't something I wanted to see in the first place. I don't have a best way forward to have this merged due to objections expressed by hashicorp folks earlier in the comments. Again, I'd like to wait for @crw's answer, but I did a small research to make the call about what the requirements are:
Since this is a PR older than the license switch, I'd love to be able to contribute this to opentofu as well, if possible -- that's I mentioned my license concerns. It's not a showstopper (I don't want to block PR that's publicly visible either way), but I think users of both projects deserve to have this fixed. |
I also thought about file locks. The Terraform codebase has already an implementation of file locks (for both UNIX and Windows!), so it could be feasible to extract it and use that. |
Nice! One of my main concerns here was that introducing an additional library (I found https://github.com/danjacques/gofslock) might again get problematic license-wise and a DIY solution would sidetrack the fix quite a lot. |
In #36085 I wrote a proof of concept implementation for the file locks. I ran it on a Terraform module monorepo that gave trouble when initializing with a shared cache, and the file lock seems to solve the problem (the file rename alone didn't). Perhaps the |
Thanks for the continued thought on this issue. Unfortunately I do not have an update to the license question, the person (lawyer) who can offer an assessment on the IP implications is on a long vacation, so we are waiting his return. In the meantime I will bring this PR and the associated new PRs up once again in triage to see if there is any change on willingness to review. Thanks! |
The original implementation of plugin cache handling unpacks plugin
archives in a way that can result in race conditions when multiple
terraform processes are accessing the same plugin cache. Since the
archives are being decompressed in chunks and output files are written
directly to the cache, we observed following manifestations of the issue:
text file busy
errors if other terraform processes are alreadyrunning the plugin and another process tries to replace it.
and rewriting the file.
This PR changes the zip archives with plugins are handled:
installFromLocalArchive
is now writing to a temporary stagingdirectory prefixed with
.temp
that resides in the plugin cache dirto ensure this is on the same filesystem.
replicated in the
targetDir
. The directories are created as neededand any files are moved using
os.Rename
. After this, the stagingdirectory is removed.
SearchLocalDirectory
and make it fail in the situation when they'reremoved during directory traversal, the function has been modified to
skip any entry that starts with dot.
Signed-off-by: Milan Plzik [email protected]
Fixes #31964
Target Release
1.5.x
Draft CHANGELOG entry
BUG FIXES