Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-eks: Constructing a AwsAuth can completely break a cluster #28333

Open
jfly opened this issue Dec 11, 2023 · 8 comments
Open

aws-eks: Constructing a AwsAuth can completely break a cluster #28333

jfly opened this issue Dec 11, 2023 · 8 comments
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. p2

Comments

@jfly
Copy link

jfly commented Dec 11, 2023

Describe the bug

We recently added a new AwsAuth construct to our cdk app. We were shocked to find that this broke all our node groups (as the new AwsAuth construct overwrote the aws-auth ConfigMap from cdk's internally managed AwsAuth construct).

Expected Behavior

I expect CDK to enforce that there is exactly one AwsAuth construct per cluster.

Current Behavior

As mentioned above, the new AwsAuth construct clobbered the aws-auth ConfigMap managed by the cluster's internal AwsAuth construct.

Reproduction Steps

Here's a simple cdk app that declares a cluster, a nodegroup for that cluster, and an explicit AwsAuth for that cluster:

$ cat demo.py
import pprint

import aws_cdk as cdk
from aws_cdk import aws_ec2
from aws_cdk import aws_eks
from aws_cdk.assertions import Template


class DemoStack(cdk.Stack):
    def __init__(self, scope):
        super().__init__(scope, "DemoStack")

        cluster = aws_eks.FargateCluster(
            scope=self,
            id="demo-cluster",
            cluster_name="demo-cluster",
            version=aws_eks.KubernetesVersion.V1_24,
        )

        cluster.add_nodegroup_capacity(
            'demo-nodegroup',
            nodegroup_name='demo-nodegroup',
            instance_types=[aws_ec2.InstanceType('c6a.xlarge')],
        )

        aws_eks.AwsAuth(self, "problematic-aws-auth", cluster=cluster)


app = cdk.App()
demo_stack = DemoStack(app)

template = Template.from_stack(demo_stack)
custom_resources = template.find_resources("Custom::AWSCDK-EKS-KubernetesResource")
pprint.pprint(custom_resources)

Note how this produces 2 Custom::AWSCDK-EKS-KubernetesResource resources. These two resources will happily step on each other's toes (because of this (unfortunately necessary) overwrite: true).

$ python demo.py
{'democlusterAwsAuthmanifest59F406F8': {'DeletionPolicy': 'Delete',
                                        'DependsOn': ['democlusterKubectlReadyBarrier409E0356'],
                                        'Properties': {'ClusterName': {'Ref': 'democlusterE73BD733'},
                                                       'Manifest': {'Fn::Join': ['',
                                                                                 ['[{"apiVersion":"v1","kind":"ConfigMap","metadata":{"name":"aws-auth","namespace":"kube-system","labels":{"aws.cdk.eks/prune-c8e20f52ee1502bfdc03c85de68fda3a9cbb05b58d":""}},"data":{"mapRoles":"[{\\"rolearn\\":\\"',
                                                                                  {'Fn::GetAtt': ['democlusterfargateprofiledefaultPodExecutionRoleB482677B',
                                                                                                  'Arn']},
                                                                                  '\\",\\"username\\":\\"system:node:{{SessionName}}\\",\\"groups\\":[\\"system:bootstrappers\\",\\"system:nodes\\",\\"system:node-proxier\\"]},{\\"rolearn\\":\\"',
                                                                                  {'Fn::GetAtt': ['democlusterNodegroupdemonodegroupNodeGroupRole6A5CDA1C',
                                                                                                  'Arn']},
                                                                                  '\\",\\"username\\":\\"system:node:{{EC2PrivateDNSName}}\\",\\"groups\\":[\\"system:bootstrappers\\",\\"system:nodes\\"]}]","mapUsers":"[]","mapAccounts":"[]"}}]']]},
                                                       'Overwrite': True,
                                                       'PruneLabel': 'aws.cdk.eks/prune-c8e20f52ee1502bfdc03c85de68fda3a9cbb05b58d',
                                                       'RoleArn': {'Fn::GetAtt': ['democlusterCreationRole733E3675',
                                                                                  'Arn']},
                                                       'ServiceToken': {'Fn::GetAtt': ['awscdkawseksKubectlProviderNestedStackawscdkawseksKubectlProviderNestedStackResourceA7AEBA6B',
                                                                                       'Outputs.DemoStackawscdkawseksKubectlProviderframeworkonEvent705EBCF2Arn']}},
                                        'Type': 'Custom::AWSCDK-EKS-KubernetesResource',
                                        'UpdateReplacePolicy': 'Delete'},
 'problematicawsauthmanifestEB215835': {'DeletionPolicy': 'Delete',
                                        'DependsOn': ['democlusterKubectlReadyBarrier409E0356'],
                                        'Properties': {'ClusterName': {'Ref': 'democlusterE73BD733'},
                                                       'Manifest': '[{"apiVersion":"v1","kind":"ConfigMap","metadata":{"name":"aws-auth","namespace":"kube-system","labels":{"aws.cdk.eks/prune-c8993b9de23d4ff03256c14b261e46b123d26fe727":""}},"data":{"mapRoles":"[]","mapUsers":"[]","mapAccounts":"[]"}}]',
                                                       'Overwrite': True,
                                                       'PruneLabel': 'aws.cdk.eks/prune-c8993b9de23d4ff03256c14b261e46b123d26fe727',
                                                       'RoleArn': {'Fn::GetAtt': ['democlusterCreationRole733E3675',
                                                                                  'Arn']},
                                                       'ServiceToken': {'Fn::GetAtt': ['awscdkawseksKubectlProviderNestedStackawscdkawseksKubectlProviderNestedStackResourceA7AEBA6B',
                                                                                       'Outputs.DemoStackawscdkawseksKubectlProviderframeworkonEvent705EBCF2Arn']}},
                                        'Type': 'Custom::AWSCDK-EKS-KubernetesResource',
                                        'UpdateReplacePolicy': 'Delete'}}

Possible Solution

I have read through #19218, which talks about a similar issue (the aws-auth ConfigMap getting unexpectedly clobbered), but that's all about using multiple tools to manage a cluster. In the scenario described here, we're using exactly 1 CDK app to manage our cluster, and IMO, CDK should protect us against this.

Some ideas for how to fix this:

  1. When constructing a AwsAuth for a cluster, check if there's already a AwsAuth instantiated for that cluster. If there is one already, crash.
  2. Don't expose the AwsAuth construct at all, just keep it as an internal implementation detail of Cluster. As I understand things, it's never something you'd want to use directly when you've declared your cluster in cdk (instead you'd want to interact with the underlying cluster.awsAuth attribute as documented here).
  3. (what we're going to do until this is addressed in cdk itself) Add a post-synthesis step that loops over all AwsAuth constructs in the tree and asserts that they're for different clusters.

Additional Information/Context

No response

CDK CLI Version

2.109.0 (build 941dc16)

Framework Version

No response

Node.js Version

v18.15.0

OS

N/A (this happens both on macOS and Linux)

Language

Python

Language Version

Python (3.10.6)

Other information

No response

@jfly jfly added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 11, 2023
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Dec 11, 2023
@pahud
Copy link
Contributor

pahud commented Dec 12, 2023

Thanks for the report.

Actually every Cluster comes with a built-in AwsAuth and you don't need to create a new one, instead, you should update the existing one like this as described in the doc:

cluster.awsAuth.addMastersRole(role);

Is there any reason you need to create another one for the existing cluster?

@pahud pahud added p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Dec 12, 2023
@jfly
Copy link
Author

jfly commented Dec 12, 2023

Is there any reason you need to create another one for the existing cluster?

@pahud, no. For more context: we recently had a complete outage of one of our clusters because we deployed a diff adding a new AwsAuth construct to our cdk app. That definitely wasn't what we meant to do (as you said, we should have used cluster.awsAuth.addMastersRole), but mistakes happen, and IMO this is very dangerous footgun for CDK to have.

I totally understand that it's always going to be possible for people to do dangerous things with an IAC tool (we could delete our cluster, for instance), but we are diligent about reviewing all code changes to our CDK app, and nobody at our company recognized that this diff would cause a cluster outage:

--- before
+++ after
@@ -14,3 +14,5 @@
             nodegroup_name='demo-nodegroup',
             instance_types=[aws_ec2.InstanceType('c6a.xlarge')],
         )
+
+        aws_eks.AwsAuth(self, "problematic-aws-auth", cluster=cluster)

Running cdk diff doesn't help either, you just see the addition of a custom resource, with no clue that this is going to do something very dangerous:

$ cdk diff
Stack DemoStack
Resources
[+] Custom::AWSCDK-EKS-KubernetesResource problematic-aws-auth/manifest/Resource problematicawsauthmanifestEB215835

In other words: it's totally ok for CDK to enable us to do dangerous things, but (IMO) those dangerous things need to at a least look dangerous. For example, deleting a cluster is very obviously a dangerous thing, both when you look at the code diff, and the cdk diff. As an example, if the AwsAuth class were instead renamed to something like DangerouslyClobberAwsAuthInYourClusterYouProbablyDoNotWantToConstructMeYourself, then we probably wouldn't have had our outage =)

That all said, I still think it would be better for CDK to just ensure that there are exactly 0 or 1 AwsAuth constructs per cluster. I think this would be a pretty straightforward change that would remove this footgun for other people. I'd be happy to send in such a PR if that sounds acceptable to you?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 13, 2023
@jfly
Copy link
Author

jfly commented Dec 26, 2023

@pahud, any update here?

@pahud
Copy link
Contributor

pahud commented Dec 27, 2023

@jfly Totally agreed with you.

Yes, it sounds good to have a PR as the guardrail. Feel free to submit your PR for that. Thanks.

@jfly
Copy link
Author

jfly commented Jan 8, 2024

I just learned that this aws-auth ConfigMap is getting replaced with a new access entity and access policy mechanism: https://aws.amazon.com/blogs/containers/a-deep-dive-into-simplified-amazon-eks-access-management-controls/. We're going to put our effort into that, which should avoid the issue described here.

@pahud
Copy link
Contributor

pahud commented Jan 9, 2024

Yes, it's related to #28588

@jsamuel1
Copy link

I just got hit by this issue and lost a full day of engineering for an entire team as we repaired all of the test clusters that this deployed to before we realised it broke everything.

Minimum viable fix - Please add a warning in the documentation to say: "Do not instantiate this class directly. Use the ICluster.awsAuth property", until you can prioritize the deprecation of the public constructor.

@jsamuel1
Copy link

Of note - I was using Cdk Blueprints for eks to build the cluster in one stack, and my google search on how I can add awd auth permissions for my lambda in a related stack from cdk got me to the AwsAuth construct, rather than the instance of the class on ICluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. p2
Projects
None yet
Development

No branches or pull requests

3 participants