Added capability in azure-cosmos-spark to allow the spark environment to support access tokens via AccountDataResolver #40079
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The Cosmos DB Spark connector always had an option for spark environments to transform the user's configuration to allow adding custom authentication - for example for linked services in Synapse. The service used for this is
com.azure.cosmos.spark.AccountDataResolver
- this PR extends the possible custom authentication methods to include access tokens (for example for AAD auth via Managed Identity or Service Principal)Authentication options in the Cosmos DB Spark connector
The Cosmos DB Spark connector comes with the following built-in authentication mechanisms (the one to be used can be specified via
spark.cosmos.auth.type
config)MasterKey
- symmetric key based authentication - can be the read-only Primary/Secondary key as long as only reads are done or the read-write keys. This is the default authentication mechanism being used if thespark.cosmos.auth.type
config is not specified.ServicePrincipal
- an option to provide a service principal (via its AppId) and the symmetric client secret. The following configuration entries are required:spark.cosmos.auth.aad.clientId
- the service principal's AppIdspark.cosmos.auth.aad.clientSecret
- the symmetric client secret for authenticating as the service principalspark.cosmos.account.tenantId
- the tenantId for both the Cosmos DB account and the service pirncipalspark.cosmos.account.subscriptionId
- the susbcriptionId of the Cosmos DB account - NOTE - this might not be required anymore in a few weeks.spark.cosmos.account.resourceGroupName
- the resource group name of the Cosmos DB account - NOTE - this might not be required anymore in a few weeks.spark.cosmos.account.azureEnvironment
- Optional - the name of the Azure environment (cloud). The default value isAzure
(public cloud)ManagedIdentity
- an option to use a provided managed identity for authentication. The Managed identity needs to be available on the VMs running the Spark driver/executor and the instance metadata access endpoint needs to be enabled to allow retrieving the managed identity.Azure Databricks
allows this authentication type in recent workspaces (if you are missing the user provided managed identity you might need to create a new workspace). You can identify the OID of the Databricks managed identity by checking thedbmanagedidentity
resource int he "managed resource group" for yourAzure Databricks
workspace.:spark.cosmos.account.tenantId
- the tenantId for both the Cosmos DB account and the service pirncipalspark.cosmos.account.subscriptionId
- the susbcriptionId of the Cosmos DB account - NOTE - this might not be required anymore in a few weeks.spark.cosmos.account.resourceGroupName
- the resource group name of the Cosmos DB account - NOTE - this might not be required anymore in a few weeks.spark.cosmos.auth.aad.clientId
- Optional - the managed identity's clientId - this helps when you have more than one managed identity to pick the right one.spark.cosmos.auth.aad.resourceId
- Optional - the managed identity's resourceId - this helps when you have more than one managed identity to pick the right one.spark.cosmos.account.azureEnvironment
- Optional - the name of the Azure environment (cloud). The default value isAzure
(public cloud)AccessToken
- this is the hint that will delegate providing access tokens to an implementation of theAccountDaatResolver
trait - AAD auth will be used with the bearer token provided in the Cosmos DB connectorspark.cosmos.account.tenantId
,spark.cosmos.account.subscriptionId
andspark.cosmos.account.resourceGroupName
are stillr equired but will often be provided via config transformation by the implementation ofAccountDataResolver.getAccountDataConfig
AccountDataResolver
implementation.Necessary steps when building a custom
AccountDataResolver
implementationcom.azure.cosmos.spark.AccountDataResolver
trait needs to be available on the class pathAccountDataResolver
azure-cosmos-spark-account-data-resolver-sample
(see sample/AccountTokenResolverSample.ipynb) provides a sample implementation of authentication via master key, ServicePrincipal (password and client cert) as well as managed identity via a customAccoutDataResolver
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines
azure-cosmos-spark-account-data-resolver-sample
module