Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(custom-resources): After failed update of AwsCustomResource, log stream ID is passed to onDelete, causing it to get stuck #30012

Open
Booligoosh opened this issue Apr 30, 2024 · 3 comments
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources bug This issue is a bug. effort/medium Medium work item – several days of effort p2

Comments

@Booligoosh
Copy link
Contributor

Describe the bug

I am creating an AWS custom resource that requires replacement every time, meaning the onCreate & onUpdate both return a physicalResourceId and don't require one as input, however the onDelete does require one as input using new cr.PhysicalResourceIdReference():

new cr.AwsCustomResource(this, 'aws-custom', {
  onCreate: {
    service: '...',
    action: '...',
    parameters: {
      text: '...',
    },
    physicalResourceId: cr.PhysicalResourceId.fromResponse('...'),
  },
  onUpdate: {
    service: '...',
    action: '...',
    parameters: {
      text: '...',
    },
    physicalResourceId: cr.PhysicalResourceId.fromResponse('...'),
  },
  onDelete: {
    service: '...',
    action: '...',
    parameters: {
      text: '...',
      resourceId: new cr.PhysicalResourceIdReference(),
    },
  },
  policy: cr.AwsCustomResourcePolicy.fromSdkCalls({
    resources: cr.AwsCustomResourcePolicy.ANY_RESOURCE,
  }),
})

When an update fails, the physical ID is changed to a log stream ID, eg. 2024/04/30/[$LATEST]xxxxxxxx. CFN then tries to call onDelete with this log stream ID as the resource ID, which obviously fails, meaning onDelete gets stuck.

Expected Behavior

When an update fails, the physical resource ID is not changed, and onDelete is called with the existing physical resource ID.

Current Behavior

When an update fails, the physical resource ID is changed to be the ID of a log stream, and onDelete is called with that log stream ID as the physical resource ID, causing it to fail.

Reproduction Steps

  1. Create an AWS Custom Resource like the example above, with new cr.PhysicalResourceIdReference() passed to onDelete.
  2. Deploy the resource
  3. Trigger an update of the resource that fails for some reason (eg. invalid format of parameter)
  4. Observe that onDelete is called with a log stream ID rather than the physical ID of the resource

Possible Solution

The bug comes from this line in the SDK v2 handler:

await respond(event, 'FAILED', e.message || 'Internal Error', context.logStreamName, {}, true);

And this line in the SDK v3 handler:

await respond(event, 'FAILED', e.message || 'Internal Error', context.logStreamName, {}, true);

When sending a FAILED event, context.logStreamName is passed as the physical resource ID to respond(), rather than passing physicalResourceId. I've been trying to work out the reasoning behind why it's like this and nothing has come to mind, I can't see any scenario where you want a failure log stream as the physical ID, so it could just be an accidental error.

The fix would be to pass physicalResourceId as the 4th parameter to respond() rather than passing context.logStreamName.

Additional Information/Context

No response

CDK CLI Version

2.115.0

Framework Version

No response

Node.js Version

20.11.1

OS

Windows 10 Enterprise

Language

TypeScript

Language Version

No response

Other information

No response

@Booligoosh Booligoosh added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 30, 2024
@github-actions github-actions bot added the @aws-cdk/custom-resources Related to AWS CDK Custom Resources label Apr 30, 2024
@pahud
Copy link
Contributor

pahud commented Apr 30, 2024

Thank you. We'll check this with the team.

@pahud pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. p2 effort/medium Medium work item – several days of effort p1 and removed needs-triage This issue or PR still needs to be triaged. p2 labels Apr 30, 2024
@pahud
Copy link
Contributor

pahud commented Apr 30, 2024

I will investigate this issue but I would appreciate if you have any working sample for me that I can simply run and verify in my account. I actually doubt that onDelete would be invoked when onUpdate fails.

In the case where the onUpdate operation for a CloudFormation custom resource fails, the
onDelete operation should not be invoked.

When a CloudFormation stack update fails, CloudFormation will attempt to roll back the stack to the previous, working state. This means that any changes made by the failed onUpdate operation will be reverted, and the custom resource will remain in its previous state.

Since the custom resource was not successfully updated, there is no need to delete the resource. The
onDelete operation will only be invoked if the entire custom resource is being deleted as part of the stack deletion process.

Can you tell me what made you believe that onDelete was invoked? Did you see any logs?

And this was already in the code base in 2.137.0 so I don't think this is the root cause:

await respond(event, 'FAILED', e.message || 'Internal Error', context.logStreamName, {});

Please share some minimal required code snippets with me so I can reproduce it in my account. Thanks.

@pahud pahud added p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed p1 investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Apr 30, 2024
@pahud
Copy link
Contributor

pahud commented Apr 30, 2024

By the way

I would highly recommend not assigning physicalResourceId like this for both onCreate and onUpdate because you can't guarantee they are always the same.

physicalResourceId: cr.PhysicalResourceId.fromResponse('...'),

Let's consider this scenario:

  1. Your onCreate was successful and you got the API response hence physicalResourceId was assigned.
  2. Now you tried to update that custom resource but failed. As it was failed, no expected response field returned, you ended up having different physicalResourceId assigned.
  3. Because now you have different physicalResourceId between onUpdate and onCreate, CFN would consider you are trying to replace the old resource with a new resource but failed. Now onDelete would be invoked trying to delete that failed onUpdate resource.
  4. This ended up always trying to create a new one to replace the created one and invoking the onDelete again.

With that being said, I would suggest to make the physicalResourceId decoupled from the API response like

physicalResourceId: cr.PhysicalResourceId.of('some-static-id');

I guess this would not trigger the unexpected behavior as I presume above.

Let me know if it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources bug This issue is a bug. effort/medium Medium work item – several days of effort p2
Projects
None yet
Development

No branches or pull requests

2 participants