Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REP: Kubernetes Native Ray Authentication #51

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

andrewsykim
Copy link
Contributor

Add REP for supporting authentication for Ray clusters created by KubeRay using the Kubernetes TokenReview API

@kevin85421 kevin85421 self-assigned this May 22, 2024

The sidecar authenticator will use the TokenReview API to authenticate tokens passed into the authorization headers of each request.
Tokens can be created from Kubernetes ServiceAccounts or an external identity provider as long as the Kubernetes API Server is configured with an external authorizer.
Each RayCluster will be provisioned a default ServiceAccount that KubeRay Operator will use when authenticating to the Ray head (specifically for RayJob and RayService).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful for the implementation phase, to explicitly mention here the case of RayJobs with submission mode set to K8sJobMode, for which the Ray CLI command executed in the submission Job will have to be authenticated as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add that!

3. A ServiceAccount is created per RayCluster resource

The sidecar ensures only requests with the correct authorization tokens can access the Ray head. More details on how tokens are authenticated below.
For the MVP, the sidecar will be implemented using Envoy configured with an [External Authorizer](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_authz_filter).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also chatted with some users about this project. They have their own internal authentication solution. We should make the interface extensible and not be too opinionated about Envoy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't feel too strongly about using Envoy.

I also think implementing from scratch wouldn't be that difficult, we can consider a new binary / image for the auth server.

@kevin85421
Copy link
Member

This PR looks good to me. Maybe we can organize a community design review meeting for this REP.

@kevin85421
Copy link
Member

cc @thomasdesr would you mind reviewing this PR? Thanks!

...
spec:
enableAuthentication: true
authenticationOptions:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some organisations, RayCluster resources will be created by the "data scientist" / end-user persona, while the security aspect will be owned by the platform administrator. The current proposal makes it difficult to enforce that model with Kubernetes RBAC.

The API surface of that new feature should ideally follow usual API graduation. Including it into the RayCluster API, that's v1, will make it difficult to iterate on it and make it evolve, by complying to v1 guarantees.

The RayCluster API has already a large surface. Could the proposal explore the solution space, and research a more modular approach for the configurability aspect, that'd be more symmetrical to the sidecar implementation proposal, i.e. non intrusive, and generally try to keep the single-responsibility principle for the RayCluster API as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some organisations, RayCluster resources will be created by the "data scientist" / end-user persona, while the security aspect will be owned by the platform administrator. The current proposal makes it difficult to enforce that model with Kubernetes RBAC.

My assumption is that a platform admin can grant role / rolebindings that give data scientists namespace-wide access to RayClusters (both creating them and accessing them) so they don't need to manage security on a per-cluster basis. You can also make this a ClusterRole to give access to all RayClusters in the cluster. In organizations where data scientists also create RayClusters, they also have the option to by-pass the RBAC check with the fixed list of allowed users (allowedPrincipals).

The API surface of that new feature should ideally follow usual API graduation. Including it into the RayCluster API, that's v1, will make it difficult to iterate on it and make it evolve, by complying to v1 guarantees.

I think this is more of a concern around feature gates and gating changes to v1 APIs that might not be mature yet. I will try to explore adding Kubernetes-like feature gates to KubeRay as we implement this. FYI @kevin85421

The RayCluster API has already a large surface. Could the proposal explore the solution space, and research a more modular approach for the configurability aspect, that'd be more symmetrical to the sidecar implementation proposal, i.e. non intrusive, and generally try to keep the single-responsibility principle for the RayCluster API as much as possible.

Let me give this more thought and let's revisit during code review? In theory the addition of a sidecar is already modular because RayCluster lets you override the entire pod template. So my initial thinking is that enabling authentication in KubeRay should default to a solution that will work for most users out of the box (with something like kube-rbac-proxy)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss more during code review about the API.

@andrewsykim andrewsykim changed the title REP: KubeRay authentication with TokenReview API REP: Kubernetes Native Ray Authentication Jun 7, 2024
## Access Control with TokenReview API

The sidecar authenticator will use the TokenReview API to authenticate tokens passed into the authorization headers of each request.
Tokens can be created from Kubernetes ServiceAccounts or an external identity provider as long as the Kubernetes API Server is configured with an external authorizer.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For KSA tokens, the sidecar should require them to have audiences that do not intersect with the audiences accepted by kube-apiserver. We don't want KSA tokens that grant API access included in requests destined for anywhere other than kube-apiserver. Note that audiences can be specified for mounted tokens using projected volumes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#serviceaccount-token-volume-projection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, added a note about KSA requiring audiences

apiVersion: authentication.k8s.io/v1
kind: TokenReview
spec:
token: <your_token_here>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to requiring a ray-specific audience here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, added a note about KSA requiring audiences

apiVersion: authentication.k8s.io/v1
kind: TokenReview
status:
authenticated: true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokenreview response status must contain the audience the token is good for, and should be verified to intersect with the audience requested in the tokenreview spec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated the status example to include an audience that will be verified with the audience in the spec


The sidecar authenticator will use the TokenReview API to authenticate tokens passed into the authorization headers of each request.
Tokens can be created from Kubernetes ServiceAccounts or an external identity provider as long as the Kubernetes API Server is configured with an external authorizer.
Tokens from Kubernetes Service Accounts are required to specify an audience. For now the placeholder audience is "ray.io".
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, audiences are globally unique; a token meant for interacting with ray cluster A should not be accepted by Ray cluster B

Perhaps the audience should be something like ray.io/cluster-name/foo, where foo is somehow derived from the RayCluster resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, updated to require unique auidences


When submitting jobs using the Ray job CLI:
```
export RAY_JOB_HEADERS="Authorization: Bearer $(kubectl create token ray-cluster)"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user will need to be given impersonation permissions for the service account default/ray-cluster. If multiple users are using the cluster, then all their access will be laundered through this service account.

The recommended usage should probably be the one from line 69, where the user authenticates to the ray cluster using the same credential they use to authenticate to the Kubernetes cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommended usage should probably be the one from line 69, where the user authenticates to the ray cluster using the same credential they use to authenticate to the Kubernetes cluster.

Yes, we will likely recommend this for most cases, the KSA case is only supported for two cases:

  1. For KubeRay operator itself that needs to connect to Ray clusters
  2. For clusters that don't have external identity providers

```
export RAY_JOB_HEADERS="Authorization: Bearer $(kubectl create token ray-cluster)"
OR
export RAY_JOB_HEADERS="Authorization: Bearer $(gcloud auth print-access-token)" # example external identity provider
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised that this would work, but it does seem like TokenReview will understand external tokens by calling the Webhook Token Authenticator, if one is configured.

We'll need to think about how to make this mesh with the audiences from above. GCP (and presumably AWS, Azure, etc) access tokens don't carry scopes, so they won't pass the scope check by default.

Copy link
Contributor Author

@andrewsykim andrewsykim Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised that this would work, but it does seem like TokenReview will understand external tokens by calling the Webhook Token Authenticator, if one is configured.

I was also pleasantly surprised as well :)

We'll need to think about how to make this mesh with the audiences from above. GCP (and presumably AWS, Azure, etc) access tokens don't carry scopes, so they won't pass the scope check by default.

From my testing w/ GKE, the returned audience is something like:

audiences:
- https://container.googleapis.com/v1/projects/<project>/locations/us-east4-c/clusters/<cluster>

Perhaps we need to specify a list of allowed audiences in the RayCluster if we are required to verify the audience?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yes, that seems like it would work.

SubjectAccessReview API

Signed-off-by: Andrew Sy Kim <[email protected]>
@zhe-thoughts
Copy link
Collaborator

We had an in-depth review meeting on this feature on June 4. The vote on ray-committers has also just passed. I'm going to merge this REP

Thanks @andrewsykim for the great contribution and @kevin85421 for shepherding

Screenshot 2024-07-22 at 2 09 56 PM

@zhe-thoughts zhe-thoughts merged commit e39668b into ray-project:main Jul 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants