Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added capability in azure-cosmos-spark to allow the spark environment to support access tokens via AccountDataResolver #40079

Merged

Conversation

FabianMeiswinkel
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel commented May 8, 2024

Description

The Cosmos DB Spark connector always had an option for spark environments to transform the user's configuration to allow adding custom authentication - for example for linked services in Synapse. The service used for this is com.azure.cosmos.spark.AccountDataResolver - this PR extends the possible custom authentication methods to include access tokens (for example for AAD auth via Managed Identity or Service Principal)

Authentication options in the Cosmos DB Spark connector

The Cosmos DB Spark connector comes with the following built-in authentication mechanisms (the one to be used can be specified via spark.cosmos.auth.type config)

  • MasterKey - symmetric key based authentication - can be the read-only Primary/Secondary key as long as only reads are done or the read-write keys. This is the default authentication mechanism being used if the spark.cosmos.auth.type config is not specified.
  • ServicePrincipal - an option to provide a service principal (via its AppId) and the symmetric client secret. The following configuration entries are required:
    • spark.cosmos.auth.aad.clientId - the service principal's AppId
    • spark.cosmos.auth.aad.clientSecret - the symmetric client secret for authenticating as the service principal
    • spark.cosmos.account.tenantId - the tenantId for both the Cosmos DB account and the service pirncipal
    • spark.cosmos.account.subscriptionId - the susbcriptionId of the Cosmos DB account - NOTE - this might not be required anymore in a few weeks.
    • spark.cosmos.account.resourceGroupName - the resource group name of the Cosmos DB account - NOTE - this might not be required anymore in a few weeks.
    • spark.cosmos.account.azureEnvironment - Optional - the name of the Azure environment (cloud). The default value is Azure (public cloud)
  • ManagedIdentity - an option to use a provided managed identity for authentication. The Managed identity needs to be available on the VMs running the Spark driver/executor and the instance metadata access endpoint needs to be enabled to allow retrieving the managed identity. Azure Databricks allows this authentication type in recent workspaces (if you are missing the user provided managed identity you might need to create a new workspace). You can identify the OID of the Databricks managed identity by checking the dbmanagedidentity resource int he "managed resource group" for your Azure Databricks workspace.:
    • spark.cosmos.account.tenantId - the tenantId for both the Cosmos DB account and the service pirncipal
    • spark.cosmos.account.subscriptionId - the susbcriptionId of the Cosmos DB account - NOTE - this might not be required anymore in a few weeks.
    • spark.cosmos.account.resourceGroupName - the resource group name of the Cosmos DB account - NOTE - this might not be required anymore in a few weeks.
    • spark.cosmos.auth.aad.clientId - Optional - the managed identity's clientId - this helps when you have more than one managed identity to pick the right one.
    • spark.cosmos.auth.aad.resourceId - Optional - the managed identity's resourceId - this helps when you have more than one managed identity to pick the right one.
    • spark.cosmos.account.azureEnvironment - Optional - the name of the Azure environment (cloud). The default value is Azure (public cloud)
  • AccessToken - this is the hint that will delegate providing access tokens to an implementation of the AccountDaatResolver trait - AAD auth will be used with the bearer token provided in the Cosmos DB connector
    • The confgiurations for spark.cosmos.account.tenantId, spark.cosmos.account.subscriptionId and spark.cosmos.account.resourceGroupName are stillr equired but will often be provided via config transformation by the implementation of AccountDataResolver.getAccountDataConfig
    • See the next paragraph for few hints in case you want to create your own AccountDataResolver implementation.

Necessary steps when building a custom AccountDataResolver implementation

  • An implementation of the com.azure.cosmos.spark.AccountDataResolver trait needs to be available on the class path
  • a file in "src/main/resources/META-INF/services/com.azure.cosmos.spark.AccountDataResolver" needs to exist that allows the JVM to identify the service implementing. This file needs to be part of the jar that you use to deploy the custom AccountDataResolver
  • the module azure-cosmos-spark-account-data-resolver-sample (see sample/AccountTokenResolverSample.ipynb) provides a sample implementation of authentication via master key, ServicePrincipal (password and client cert) as well as managed identity via a custom AccoutDataResolver

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes. via azure-cosmos-spark-account-data-resolver-sample module

… to support access tokens via AccountDataResolver
Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@FabianMeiswinkel FabianMeiswinkel merged commit 05f7511 into Azure:main May 14, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants