Local XFS devices in Kubernetes with Redpanda Operator #681
Replies: 2 comments 1 reply
-
This is amazing, very well researched 👏 I don't really have comments here, I like the proposed way and I think the change in controller code is actually reasonably small. I just wonder what will be our test plan for this but I assume we don't need to particularly test that since we would be testing kubernetes. Maybe we should also move this rfc into docs/rfc folder? |
Beta Was this translation helpful? Give feedback.
-
This is great, thanks @dimitriscruz and sorry for the delay getting to look at this!
|
Beta Was this translation helpful? Give feedback.
-
Executive Summary
Use local NVMe devices formatted with XFS to host Redpanda's data directory when deploying on Kubernetes through the Operator.
What is being proposed
Provide a local NVMe SSD on each host and expose it as a PV for consumption by the local Redpanda Pod.
Why (short reason)
Using an XFS partition on a local NVMe SSD is recommended for maximum throughput.
How (short plan)
A PersistentVolume is created for each NVMe SSD (one per node) and is associated with a new storage class, e.g., "local-storage-nvme". Once the redpanda Cluster resource is created, each generated PVC points to the storage class. Each pod is scheduled to a node at which point its PVC is bound to the local PV and the volume is formatted with XFS (if not already).
Impact
The Kubernetes operator currently uses the "default" storage class, which varies across Kubernetes distributions and deployments. It often implies using the root filesystem to store the Redpanda's data directory. This proposal gives each Redpanda process an XFS filesystem backed by a local NVMe SSD, which is a must for performance.
Guide-level explanation
Background Concepts
map[string]string
that is passed to the volume plugin.Example
A user creates a Redpanda cluster CR. The Redpanda Operator in response creates a StatefulSet that includes a PersistentVolumeClaim for each Pod. Each PVC points to our "local-storage-nvme" storage class. Kubernetes verifies that each node has a PV with our storage class (and requested capacity). Each device is then formatted with XFS (if not already), the Pods are scheduled across the nodes, and the filesystem is mounted by the Redpanda container under its data folder.
Reference-level explanation
Detailed design - What needs to change to get there
The proposed changes are minimal. The Cluster CRD is expanded to include a storage section that will contain a storage class name and the requested storage capacity. The operator simply includes these values in the StatefulSet spec it creates, specifically under the PVC template.
The more important changes are in operating the cluster, i.e., ensure devices are available and create or delete the PVs. Note that this is only needed when local devices are used. An existing storage class could be used without changes.
Detailed design - How it works
As is already the case, the Operator creates a StatefulSet with a PersistentVolumeClaimTemplate, which generates PVC resources. Currently the storage class is missing, hence, the default one is used. We introduce this missing storageClassName field along with a storage capacity field.
The Cluster CR would therefore look as follows (although the added section is optional):
Prior to creating a CR, an administrator must ensure an NVMe SSD is available on each cluster node (to be used). The admin creates a storage class with no provisioner. The
volumeBindingMode
is set toWaitForFirstConsumer
such that PVC is bound to a PV at the time the Pod consuming it is scheduled - binding takes into account the scheduling possibilities of the Pod.For each device the admin creates a PV that points to the above storage class.
On Cluster deletion, the Redpanda operator deletes the StatefulSet. However, deleting and/or scaling a StatefulSet down does not delete the volumes associated with the StatefulSet and neither does the Redpanda operator. This is done to ensure data safety.
Scenarios
In addition to creating a cluster from scratch there are a few more scenarios to consider.
Single Pod deletion (or restart)
kubectl delete pod cluster-i
will delete the ith Pod in the cluster directly. This is not advised but may nevertheless happen, e.g., due to a fault in the container process. The configuration, being on a pod volume (emptyDir), does not survive deletion. However, the required configuration is also stored under the data folder, which is backed by the PVC. Once the StatefulSet controller restarts the deleted Pod, the Redpanda process should rejoin the cluster as before. The Pod is scheduled on the same Node.Cluster update
Cluster recreation (delete CR then create)
get config
but in principle we could solve this problem.Drawbacks
Use of local devices comes with the risk of data loss. For example, GCE considers their local SSDs ephemeral and are deleted when a VM is stopped (or deleted). Use of network attached devices may be applicable for low performance use cases. Redpanda has topic replicas so this may not be as problematic, however, backing up data to a cloud object store like S3 can be useful. Finally, other solutions involving multi-region clusters may be applicable.
The PV requires a manual cleanup and deletion (unless an external static provisioner is used).
The proposed feature is optional, so UX is affected only if one decides to use local devices. In that case, users must ensure local devices are available and to create a list of PVs. Moreover, PVs must be recreated if released (upon StatefulSet or Cluster CR deletion). This process must be part of a guide.
Rationale and Alternatives
We have a number of alternatives depending on the level of automation and reliance on k8s functionality.
Approach 1
Operator mounts
/dev
, finds the device, formats it, and mounts it under the data directory. This could happen using an init container. No PV or PVC is created. Cons: no k8s awareness of the device, additional responsibility for operator.Approach 2
As in approach (1) but create a block PV pointing to the device and mount as a PV to Pod. Remaining is the same: format it, and mount it under the data directory. This could happen using an init container. No PVC is created. Cons: similar to approach 1.
Approach 3 (tried and proposed)
Create a PV of type
local
withfsType:xfs
and set thepath
to the block device. Create PVC to claim the PV and use as normal. The device is formatted by Kubernetes (local process). Cons: afaik, xfs formatting options are not exposed. Pros: clean, uses built-in featuresApproach 4 (proposed for next iteration)
As in approach (3) but the PVs are automatically created by a static provisioner that runs on each node and monitors for eligible devices. (It performs some of the tasks that approach 1 and 2 would). The provisioner also monitors for PVs that are "released" and recreates them. Pro: as in approach (3) + eliminates part of the manual steps; Con: requires deployment of DaemonSet and adds dependency. Provisioner: https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner. Bonus: used by Uber's M3DB (https://github.com/m3db/m3db-operator) as described here https://kubernetes.io/blog/2019/04/04/kubernetes-1.14-local-persistent-volumes-ga/
Unresolved questions
Beta Was this translation helpful? Give feedback.
All reactions