Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure CLI task fails with AADSTS700024 after 60 minutes #28708

Open
jiasli opened this issue Apr 8, 2024 · 35 comments · May be fixed by #28778
Open

Azure CLI task fails with AADSTS700024 after 60 minutes #28708

jiasli opened this issue Apr 8, 2024 · 35 comments · May be fixed by #28778
Assignees
Labels
Account az login/account Auto-Assign Auto assign by bot Azure CLI Team The command of the issue is owned by Azure CLI team feature-request
Milestone

Comments

@jiasli
Copy link
Member

jiasli commented Apr 8, 2024

Acquiring access token with expired OIDC token fails with:

ERROR: AADSTS700024: Client assertion is not within its valid time range. Current time: 2024-04-05T23:01:54.2089203Z, assertion valid from 2024-04-05T22:40:41.0000000Z, expiry time of assertion 2024-04-05T22:50:41.0000000Z. Review the documentation at https://docs.microsoft.com/azure/active-directory/develop/active-directory-certificate-credentials

As the error indicates, the OIDC token is only valid for 10 minutes. After it is passed to az login via --federated-token, Azure CLI cannot get a new OIDC token after the OIDC token expires.

This is the designed v1 behavior of OIDC token support (#19853).

However, as Azure DevOps task AzureCLI@2 (microsoft/azure-pipelines-tasks#17633) and GitHub Action azure/login@v2 (Azure/login#147) have supported OIDC token authentication, and it is recommended to use workload identity federation, this limitation is becoming more prevailing.

Possible solutions

  1. OIDC token provider such as Azure DevOps or GitHub should provide an option to control the expiry time of the OIDC token to make it at least as long as the task duration.
  2. Design and implement a v2 solution that uses a managed-identity-like interface which allows MSAL/Azure CLI to refresh OIDC token.

References

@yonzhan
Copy link
Collaborator

yonzhan commented Apr 8, 2024

refresh OIDC token is a feature

@microsoft-github-policy-service microsoft-github-policy-service bot added Azure CLI Team The command of the issue is owned by Azure CLI team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Apr 8, 2024
@yonzhan yonzhan added feature-request and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Apr 8, 2024
@yonzhan yonzhan added this to the Backlog milestone Apr 8, 2024
@jiasli
Copy link
Member Author

jiasli commented Apr 10, 2024

Callback interface proposals

Different external identity providers (IdP) have different ways of retrieving the ID token:

I had a discussion with MSAL team today and proposed 2 possible callback interfaces:

  1. Let each external IdP expose a callback command such as getidtoken that returns an ID token in stdout, then instead of providing --federated-token <ID token> to az login, they should provide --federated-token-callback getidtoken to az login, so that CLI and MSAL can actively retrieve an ID token with getidtoken when ID token expires. This is very similar to how Azure Identity's AzureCliCredential retrieves access tokens from Azure CLI by subprocessing az account get-access-token.
  2. Like the GitHub Action solution, define a manage-identity-like URL that can be used to get an ID token, such as ID_TOKEN_REQUEST_URL.

@jiasli
Copy link
Member Author

jiasli commented Apr 10, 2024

Mitigation: Extend task duration to 60 minutes

Warning

This mitigation doesn't work with Azure CLI 2.59.0. See #28708 (comment).

ID token:       |----| 10 min
Access token 1: |------------------------| 60 min
Access token 2:          | 20 min: ERROR: ID token expired

An ID token lasts for 5 minutes on GitHub Actions and 10 minutes on Azure DevOps, but an access token lasts for 60 minutes.

When you run az login, Azure CLI only acquires access tokens for ARM, using https://management.core.windows.net//.default as the scope.

After the ID token expires, if acquiring an access token for other scopes, such as

az account get-access-token --scope https://kusto.kusto.windows.net//.default

as currently there is no access token for that scope in the token cache, Azure CLI/MSAL will try to get an access token with the ID token. However, as the ID token has expired, the command fails with AADSTS700024.

So, the mitigation is pretty straightforward: Acquire all access tokens before the ID token expires.

You have to know which scopes are used in your pipeline task and call az account get-access-token --scope ... immediately after az login. This makes Azure CLI/MSAL acquire access tokens for the specified scopes while the ID token is still valid and save them in the token cache.

For example:

  • Storage: az account get-access-token --scope https://storage.azure.com/.default --output none
  • Key Vault: az account get-access-token --scope https://vault.azure.net/.default --output none
  • Microsoft Graph: az account get-access-token --scope https://graph.microsoft.com//.default --output none
  • Kusto: az account get-access-token --scope https://kusto.kusto.windows.net//.default --output none

Warning

Even though GitHub Actions can mask the access token as *** in az account get-access-token's output:

+ az account get-access-token
***
  "accessToken": "***",
  "expiresOn": "2024-04-10 14:11:25.000000",
  "expires_on": 1712758285,
  "subscription": "...",
  "tenant": "...",
  "tokenType": "Bearer"
***

You MUST specify --output none to make sure no access token is printed to any of your logs.

Then subsequence commands using these scopes will use the access tokens saved in the token cache, so that they won't fail after the ID token expires, but they will still fail after the access token expires (60 minutes).

@Kapsztajn
Copy link

Kapsztajn commented Apr 10, 2024

I tried fixing the issue with provided mitigation but it is still persistent, maybe I'm doing something wrong?
My workflow contains actions which use NodeJS tests in which I verify connections to ServiceBus. As OIDC is used I login to azure with azure/login@v2 action:

    - name: Azure login
      uses: azure/login@v2
      with:
        client-id: ${{ env.AZURE_CLIENT_ID }}
        tenant-id: ${{ env.AZURE_TENANT_ID }}
        subscription-id: ${{ env.AZURE_SUBSCRIPTION_ID }}
        enable-AzPSSession: false

After that I added step to mitigate the issue:

    - name: Azure get token
      uses: azure/cli@v2
      with:
        inlineScript: |
          az account get-access-token --scope https://storage.azure.com/.default --output none
          az account get-access-token --scope https://servicebus.azure.net/.default 

But after ~10 minutes Im still getting:

    AggregateAuthenticationError: ChainedTokenCredential authentication failed.
    CredentialUnavailableError: Please run 'az login' from a command prompt to authenticate before using this credential.
    CredentialUnavailableError: WorkloadIdentityCredential: is unavailable. tenantId, clientId, and federatedTokenFilePath are required parameters. 
          In DefaultAzureCredential and ManagedIdentityCredential, these can be provided as environment variables - 
          "AZURE_TENANT_ID",
          "AZURE_CLIENT_ID",
          "AZURE_FEDERATED_TOKEN_FILE". See the troubleshooting guide for more information: https://aka.ms/azsdk/js/identity/workloadidentitycredential/troubleshoot

Did I miss something? I use https://www.npmjs.com/package/@azure/service-bus

@mderriey
Copy link

mderriey commented Apr 10, 2024

Thanks for the mitigation @jiasli.

However, I don't think I'm hitting the issue where the Azure CLI tries to acquire an access token for a difference audience after the ID token has expired.

I'm fairly confident that the az commands I use only use the access token for ARM:

  • az account set
  • az deployment sub create
  • az deployment sub show
  • az webapp deployment slot swap
  • az webapp deployment source config-zip
  • az webapp start
  • az webapp stop

The general flow is:

  • Deploy an ARM template
  • Deploy binaries to an App Service staging slot
  • Swap slots
  • Stop App Service staging slot

The time it takes to swap slots varies greatly, however more than 5 minutes have always elapsed by the time it's done.

Now, what is strange is that stopping the slot sometimes work, and sometimes doesn't, dependending on how much time has passed since we ran azure/login.

To me, it sounds like the access token expires "quicker" than before.
Could that be?

Edit: I checked across many workflow runs, and to me it looks like the access token expires after 10 minutes.

@jiasli
Copy link
Member Author

jiasli commented Apr 11, 2024

@Kapsztajn, I can successfully get an access token for https://servicebus.azure.net/.default locally which lasts for 4600s.

> az account get-access-token --scope https://servicebus.azure.net/.default
{
  "accessToken": "...",
  "expiresOn": "2024-04-11 13:57:35.000000",
  "expires_on": 1712815055,
  "subscription": "0b1f6471-1bf0-4dda-aec3-cb9272f09590",
  "tenant": "54826b22-38d6-4fb2-bad9-b7b93a3e9c5a",
  "tokenType": "Bearer"
}

Decoded claims:

  "iat": 1712810455,
  "nbf": 1712810455,
  "exp": 1712815055,

I am not entirely sure why this line is printed:

CredentialUnavailableError: Please run 'az login' from a command prompt to authenticate before using this credential.

The Azure Service Bus client library for JavaScript SDK also didn't fail with AADSTS700024. I am not an expert of that SDK. Is it possible to collect more details on which scope the SDK requests, and why it fails with that error?

@jiasli
Copy link
Member Author

jiasli commented Apr 11, 2024

@mderriey, this seems odd as all these operations are indeed ARM operations. Could you check the actual expiration time of the access token issued for ARM?

> az account get-access-token --scope https://management.core.windows.net//.default --query expiresOn --output tsv
2024-04-11 13:47:47.000000

@iamrk04
Copy link

iamrk04 commented Apr 11, 2024

Hi @Kapsztajn, the suggested mitigation did not work for me as well. It was able to fetch the token with an expiry that was reasonable, but I was able to see the same error once the OID token expired after 5 mins.

I propose a workaround by fetching the OID token every 4 mins to avoid the expiry. I was able to get this working and here is what I did: I inserted the following step in my workflow just before the step where this token expiry issue was popping:

      - name: Fetch OID token every 4 mins
        run: |
          while true; do
            token_request=$ACTIONS_ID_TOKEN_REQUEST_TOKEN
            token_uri=$ACTIONS_ID_TOKEN_REQUEST_URL
            token=$(curl -H "Authorization: bearer $token_request" "${token_uri}&audience=api://AzureADTokenExchange" | jq .value -r)
            az login --service-principal -u ${{ secrets.CLIENT_ID }} -t ${{ secrets.TENANT_ID }} --federated-token $token --output none
            # Sleep for 4 minutes
            sleep 240
          done &

Could you try this out and see if this works for you as well?

@mderriey
Copy link

@mderriey, this seems odd as all these operations are indeed ARM operations. Could you check the actual expiration time of the access token issued for ARM?

> az account get-access-token --scope https://management.core.windows.net//.default --query expiresOn --output tsv
2024-04-11 13:47:47.000000

Good suggestion @jiasli , thanks.

Here's what I ran:

steps:
- name: Login to Azure
  uses: azure/login@v2
  with:
    client-id: ${{ env.oidcAppRegistrationClientId }}
    tenant-id: ${{ env.azureTenantId }}
    allow-no-subscriptions: true
    enable-AzPSSession: true

- name: Check token expiry
  shell: bash
  run: |
    echo "Current date: $(date '+%Y-%m-%dT%H:%M:%S')"
    echo "Token expiration: $(az account get-access-token --resource-type arm --query expiresOn --output tsv --debug)"
    echo "Token AzureAD/microsoft-authentication-library-for-python#2 expiration: $(az account get-access-token --resource-type arm --query expiresOn --output tsv --debug)"

And the output (debug output omitted):

Current date: 2024-04-11T06:57:14
Token expiration: 2024-04-11 07:57:14.000000
Token AzureAD/microsoft-authentication-library-for-python#2 expiration: 2024-04-11 07:57:14.000000

So the token is valid for 1 hour.

And both calls to az account get-access-token show this in the debug output, which I think confirms that the ARM token is cached and was originally acquired during az login:

DEBUG: msal.token_cache: event={
    "client_id": "***",
    "data": {
        "claims": "{\"access_token\": {\"xms_cc\": {\"values\": [\"CP1\"]}}}",
        "scope": [
            "https://management.core.windows.net//.default"
        ]
    },
    "environment": "login.microsoftonline.com",
    "grant_type": "client_credentials",
    "params": null,
    "response": {
        "access_token": "********",
        "expires_in": 3599,
        "ext_expires_in": 3599,
        "token_type": "Bearer"
    },
    "scope": [
        "https://management.core.windows.net//.default"
    ],
    "token_endpoint": "https://login.microsoftonline.com/<redacted>/oauth2/v2.0/token"
}

I'm not sure what happens, then...
I'll try removing the extra azure/login steps when I get some more time to see if the issue disappears.

Thanks again, let me know if I can perform some more testing if anything comes to mind.
If you'd be interested in the debug output, I could send that privately.

@jiasli
Copy link
Member Author

jiasli commented Apr 11, 2024

Apologize for the confusion caused.

As I tested today, the mitigation I provided in #28708 (comment) stopped working for Azure CLI 2.59.0, because of an MSAL regression introduced in 1.27.0 (AzureAD/microsoft-authentication-extensions-for-python#127, AzureAD/microsoft-authentication-library-for-python#644) which is adopted by Azure CLI 2.59.0 (#28556).

This regression makes MSAL's ConfidentialClientApplication bypass msal_extensions.token_cache.PersistedTokenCache, so access tokens are no longer retrieved from the token cache. Instead, every command now retrieves a new access token from the AAD Security Token Service (STS). In fact, not only the mitigation doesn't work, but even ARM commands fail with AADSTS700024 after the ID token expires.

I will work with MSAL on this issue with high priority.

Workaround

For now, please keep using service principal secret for authentication to get unblocked: https://github.com/marketplace/actions/azure-login#login-with-a-service-principal-secret

@smokedlinq
Copy link

My question is why this has popped up as an issue recently. We've had pipelines run for well over 20 minutes before and never seen this. But within the last week, it seems any workflow using Azure CLI with OIDC federated auth is experiencing this issue.

@Kapsztajn
Copy link

@iamrk04
It looks like your solution is working and I managed to run test normally (pipeline did run over 16 minutes). I have added code which you provide between Azure login and component test:

    - name: Azure login
      uses: azure/login@v2
      with:
        client-id: ${{ env.AZURE_CLIENT_ID }}
        tenant-id: ${{ env.AZURE_TENANT_ID }}
        subscription-id: ${{ env.AZURE_SUBSCRIPTION_ID }}
        enable-AzPSSession: false

    - name: Fetch OID token every 4 mins
      shell: bash
      run: |
        while true; do
          token_request=$ACTIONS_ID_TOKEN_REQUEST_TOKEN
          token_uri=$ACTIONS_ID_TOKEN_REQUEST_URL
          token=$(curl -H "Authorization: bearer $token_request" "${token_uri}&audience=api://AzureADTokenExchange" | jq .value -r)
          az login --service-principal -u ${{ env.AZURE_CLIENT_ID }} -t ${{ env.AZURE_TENANT_ID }} --federated-token $token --output none
          # Sleep for 4 minutes
          sleep 240
        done &

    - name: 'Run tests'
      shell: bash
      ...

I had to add shell: bash because without it I got errors with missing shell.

@mderriey
Copy link

My question is why this has popped up as an issue recently. We've had pipelines run for well over 20 minutes before and never seen this. But within the last week, it seems any workflow using Azure CLI with OIDC federated auth is experiencing this issue.

@smokedlinq, In my case, it's due to a new version of the GitHub hosted runner image for ubuntu-latest that was released which has Azure CLI 2.59.0 instead of 2.58.0 for the previous image.

The image went from 20240324.2.0 to 20240407.1.0.

You can see which image your run uses in the "Set up job" step at the very top.

image

@smokedlinq
Copy link

@mderriey I assumed something like that, I was more referring to how that broke inside of az.

@dghubble
Copy link

We started having problems with the v2.59.0 az cli and rolled back as a workaround. I'm not sure what about the cli release makes this more/less likely to hit this.

@jiasli
Copy link
Member Author

jiasli commented Apr 12, 2024

My question is why this has popped up as an issue recently. We've had pipelines run for well over 20 minutes before and never seen this. But within the last week, it seems any workflow using Azure CLI with OIDC federated auth is experiencing this issue.

@smokedlinq, please refer to my comment #28708 (comment).

@jiasli
Copy link
Member Author

jiasli commented Apr 12, 2024

I propose a workaround by fetching the OID token every 4 mins to avoid the expiry.

This workaround #28708 (comment) proposed by @iamrk04 of periodically calling az login is not recommended, as Azure CLI doesn't support concurrent execution and you will very likely run into some racing condition (#9427, #20273).

We started having problems with the v2.59.0 az cli and rolled back as a workaround.

This workaround #28708 (comment) proposed by @dghubble of using an old version is a correct one.

As I suggested in #28708 (comment), using service principal secret for authentication is also another acceptable workaround.

@andre-qumulo
Copy link

@jiasli Service principals are unacceptable for some of us as our security certification would require we rotate them on a regular basis. OIDC does not add that additional burden given that they are clearly short lived.

@jiasli jiasli changed the title Acquiring access token with expired OIDC token fails with AADSTS700024 Azure CLI task fails with AADSTS700024 after 60 minutes Apr 12, 2024
@jiasli jiasli unpinned this issue Apr 12, 2024
@jiasli
Copy link
Member Author

jiasli commented Apr 12, 2024

Service principals are unacceptable for some of us as our security certification would require we rotate them on a regular basis. OIDC does not add that additional burden given that they are clearly short lived.

@andre-qumulo, we plan to fix the 5-minute expiration issue in the next version of Azure CLI which will be 2.60.0 and released on 2024-04-30. Using a service principal is only a temporary workaround. Secret rotation usually happens on a monthly basis which is far beyond the time we need to fix it.

I have created a separate issue to track it:

@TomWildenhain
Copy link

I'm running into the same issue in Azure Devops for a pipeline that runs a long python script (2h40m) in an AzureCLI@2 task. Was working fine on Friday (April 5th) but started failing after that with error:

AzureCliCredential: ERROR: AADSTS700024: Client assertion is not within its valid time range. ...

Any ideas on whether an equivalent workaround is possible for Azure Devops to refresh the token every 9 minutes?

Thanks @jiasli! The mitigation steps for Azure DevOps provided here of using a service principal secret were effective.

(I ran into some trouble finding the organization id while following the instructions but was able to find the organization id with these steps: https://medium.com/@shivapatel1102001/get-list-of-organization-from-azure-devops-microsoft-account-861ea29dae93)

@jiasli
Copy link
Member Author

jiasli commented Apr 13, 2024

@TomWildenhain, based on my understanding, the steps provided by https://learn.microsoft.com/en-us/azure/devops/pipelines/library/connect-to-azure?view=azure-devops don't require organization ID when creating a service connection using service principal secret. Could you let me know which article you are following?

@geekzter
Copy link
Member

@jiasli Org id is a 1P policy.

sk593 added a commit to radius-project/radius that referenced this issue Apr 15, 2024
# Description

Long-running tests have started to fail due to the az login
authentication expiring. Based on the error message (see linked issue),
the authentication is only valid for 5 minutes. This seems to be a known
issue [based on discussion
here](Azure/azure-cli#28708 (comment))
but there's no fix so adding this temporary workaround until there's a
way to extend the time the auth is valid.


## Type of change

<!--

Please select **one** of the following options that describes your
change and delete the others. Clearly identifying the type of change you
are making will help us review your PR faster, and is used in authoring
release notes.

If you are making a bug fix or functionality change to Radius and do not
have an associated issue link please create one now.

-->

- This pull request fixes a bug in Radius and has an approved issue
(issue link required).
- This pull request adds or changes features of Radius and has an
approved issue (issue link required).
- This pull request is a minor refactor, code cleanup, test improvement,
or other maintenance task and doesn't change the functionality of Radius
#7490

<!--

Please update the following to link the associated issue. This is
required for some kinds of changes (see above).

-->

Fixes: #issue_number

Signed-off-by: sk593 <[email protected]>
@TomWildenhain
Copy link

@jiasli Thanks for your help. I was following the instructions in a banner at the top of ADO after creating the manual service connection. The banner states:

Manually created service connections use an App Registration that was created by the user. Please add a federated credential to the App Registration with the following details: Issuer: https://vstoken.dev.azure.com/<org id>, Subject identifier: sc://<org>/<project>/<sc name>. Learn more

With a link to: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/configure-workload-identity?view=azure-devops

I used the instructions to call the API here to get the org id: https://medium.com/@shivapatel1102001/get-list-of-organization-from-azure-devops-microsoft-account-861ea29dae93

@jiasli
Copy link
Member Author

jiasli commented Apr 16, 2024

@TomWildenhain, thanks for the information. If you used service principal secret to create the service connection, I don't think the federated identity credential added to the app is actually used.

@nlighten
Copy link

nlighten commented Apr 16, 2024

@jiasli Is it possible to give any realistic timeline for a fix? I am wondering if it makes sense to ask for a rollback of the cli version contained in actions/runner-images that is used by both Github Actions and Azure DevOps.

@pkoushik
Copy link

We are seeing the same issue related to moving away from service principal secrets.

We are looking into adding logic for all Az CLI calls using the ARM token to ensure it gets refreshed (but not as a background process) to get the OIDC token from idToken and reuse it to log in via az account clear && az login ...

@MoazzemHossain-bot
Copy link

If you can help to resolve that will be appreciated

@panpanwa
Copy link

I have exactly the same use case as @TomWildenhain. Is there a way to make the token valid period customable? We can't use Service principal as that's discouraged by the cred free best practices.

Even a workaround would be much appreciated.

@jhwj9617
Copy link

jhwj9617 commented Jun 1, 2024

Have the same issue for our long-running tasks:

[01:50:31 INF]  ---> (Inner Exception #3) Azure.Identity.CredentialUnavailableException: Azure CLI authentication failed due to an unknown error. See the troubleshooting guide for more information. https://aka.ms/azsdk/net/identity/azclicredential/troubleshoot ERROR: AADSTS700024: Client assertion is not within its valid time range. Current time: 2024-06-01T01:50:31.1765304Z, assertion valid from 2024-06-01T00:49:55.0000000Z, expiry time of assertion 2024-06-01T00:59:55.0000000Z. Review the documentation at https://docs.microsoft.com/azure/active-directory/develop/active-directory-certificate-credentials . Trace ID: 48af9e38-7793-458d-94af-c2962d617700 Correlation ID: 0f495332-706e-4dba-a18e-1f844f5d7a7d Timestamp: 2024-06-01 01:50:31Z
[01:50:31 INF] Interactive authentication is needed. Please run:
[01:50:31 INF] az login

@panpanwa
Copy link

panpanwa commented Jun 1, 2024

@jhwj9617 refer to the solution provided by Kapsztajn and @iamrk04, it works for me too.

Although I believe this is an unnecessary workaround which has to be done by users!

@jhwj9617
Copy link

jhwj9617 commented Jun 2, 2024

@panpanwa we are not using github actions. We're using AzureDevOps in yml, e.g.

- task: AzureCLI@2
  displayName: Run load profile
  inputs:
    azureSubscription: $(federatedCredConnection)
    scriptType: ps
    scriptLocation: scriptPath
    scriptPath: $(Pipeline.Workspace)/test.ps1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Account az login/account Auto-Assign Auto assign by bot Azure CLI Team The command of the issue is owned by Azure CLI team feature-request
Projects
None yet
Development

Successfully merging a pull request may close this issue.