Local XFS devices in Kubernetes with Redpanda Operator #681

dimitriscruz · 2021-02-23T23:31:11Z

dimitriscruz
Feb 23, 2021

Feature Name: Local XFS devices in Kubernetes with Redpanda Operator
Status: draft/in-progress
Start Date: 2021-02-23
Authors: @dimitriscruz
Related commit: dimitriscruz@5f680c0

Executive Summary

Use local NVMe devices formatted with XFS to host Redpanda's data directory when deploying on Kubernetes through the Operator.

What is being proposed

Provide a local NVMe SSD on each host and expose it as a PV for consumption by the local Redpanda Pod.

Why (short reason)

Using an XFS partition on a local NVMe SSD is recommended for maximum throughput.

How (short plan)

A PersistentVolume is created for each NVMe SSD (one per node) and is associated with a new storage class, e.g., "local-storage-nvme". Once the redpanda Cluster resource is created, each generated PVC points to the storage class. Each pod is scheduled to a node at which point its PVC is bound to the local PV and the volume is formatted with XFS (if not already).

Impact

The Kubernetes operator currently uses the "default" storage class, which varies across Kubernetes distributions and deployments. It often implies using the root filesystem to store the Redpanda's data directory. This proposal gives each Redpanda process an XFS filesystem backed by a local NVMe SSD, which is a must for performance.

Guide-level explanation

Background Concepts

PersistentVolume (PV) represents a piece of storage, call it a volume. A volume could be a block device, filesystem, directory, disk on a cloud, etc. A volume may be provisioned by an administrator or dynamically. A PV points to a storage class, which determines how the volume is managed.
A StorageClass contains the fields provisioner, parameters, and reclaimPolicy, which are used when a PersistentVolume belonging to the class needs to be dynamically provisioned.
- Each StorageClass has a provisioner that determines what volume plugin is used for provisioning PVs. A provisioner creates/deletes volumes in a storage-provider specific manner and creates the PV such that it points to that volume. The provisioner field may be set to "no-provisioner" as is the case for local devices that are provisioned statically - they already exist on the host.
- Parameters are specific to the storage provider. Unlike PV and PVC fields, parameters is an arbitrary map[string]string that is passed to the volume plugin.
- reclaimPolicy may be set to retain or delete. If retain, once the PVC is deleted the PV is not deleted. If the policy is set to delete, once the PVC is deleted, the PV is also deleted, provided the corresponding plugin supports deletion. The exact actions during deletion are up to the plugin. If the "no-provisioner" is used, "deletion" fails because the volume is not managed by a provisioner.
A PersistentVolumeClaim (PVC) is a resource allowing a user to claim a volume by describing what they need: StorageClass, capacity, block vs. filesystem, or may specify the exact PV.
Local Persistent Volume (type local) represents a local disk directly-attached to a single Kubernetes node. Unlike a HostPath Volume, with Local Persistent Volumes, the Kubernetes scheduler ensures that a pod using a Local Persistent Volume is always scheduled to the same node. Benefits include support for formatting of block devices and volume ownership using fsGroup.

Example

A user creates a Redpanda cluster CR. The Redpanda Operator in response creates a StatefulSet that includes a PersistentVolumeClaim for each Pod. Each PVC points to our "local-storage-nvme" storage class. Kubernetes verifies that each node has a PV with our storage class (and requested capacity). Each device is then formatted with XFS (if not already), the Pods are scheduled across the nodes, and the filesystem is mounted by the Redpanda container under its data folder.

Reference-level explanation

Detailed design - What needs to change to get there

The proposed changes are minimal. The Cluster CRD is expanded to include a storage section that will contain a storage class name and the requested storage capacity. The operator simply includes these values in the StatefulSet spec it creates, specifically under the PVC template.

The more important changes are in operating the cluster, i.e., ensure devices are available and create or delete the PVs. Note that this is only needed when local devices are used. An existing storage class could be used without changes.

Detailed design - How it works

As is already the case, the Operator creates a StatefulSet with a PersistentVolumeClaimTemplate, which generates PVC resources. Currently the storage class is missing, hence, the default one is used. We introduce this missing storageClassName field along with a storage capacity field.

// ClusterSpec defines the desired state of Cluster
type ClusterSpec struct {
        ...
        // Storage spec for cluster
        Storage StorageSpec     `json:"storage,omitempty"`
}

// StorageSpec defines the storage specification of the Cluster
type StorageSpec struct {
        // Storage capacity requested
        Capacity        resource.Quantity       `json:"capacity,omitempty"`
        // Storage class name
        StorageClassName        string  `json:"storageClassName,omitempty"`
}

The Cluster CR would therefore look as follows (although the added section is optional):

apiVersion: redpanda.vectorized.io/v1alpha1
kind: Cluster
metadata:
  name: cluster-sample
  labels:
    app.kubernetes.io/name: "redpanda"
    app.kubernetes.io/instance: "redpanda-cluster-sample"
spec:
  image: "vectorized/redpanda"
  version: "v21.2.2"
  replicas: 3
  ...
  storage:
    capacity: 100Gi
    storageClassName: local-storage-nvme
  ...

Prior to creating a CR, an administrator must ensure an NVMe SSD is available on each cluster node (to be used). The admin creates a storage class with no provisioner. The volumeBindingMode is set to WaitForFirstConsumer such that PVC is bound to a PV at the time the Pod consuming it is scheduled - binding takes into account the scheduling possibilities of the Pod.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

For each device the admin creates a PV that points to the above storage class.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nvme-i
spec:
  capacity:
    storage: 100Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage-nvme
  local:
    path: /dev/disk/by-id/nvme-device
    fsType: xfs
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - node-i

On Cluster deletion, the Redpanda operator deletes the StatefulSet. However, deleting and/or scaling a StatefulSet down does not delete the volumes associated with the StatefulSet and neither does the Redpanda operator. This is done to ensure data safety.

Scenarios

In addition to creating a cluster from scratch there are a few more scenarios to consider.

Single Pod deletion (or restart)

kubectl delete pod cluster-i will delete the ith Pod in the cluster directly. This is not advised but may nevertheless happen, e.g., due to a fault in the container process. The configuration, being on a pod volume (emptyDir), does not survive deletion. However, the required configuration is also stored under the data folder, which is backed by the PVC. Once the StatefulSet controller restarts the deleted Pod, the Redpanda process should rejoin the cluster as before. The Pod is scheduled on the same Node.

Cluster update

In the case of a rolling update, the pod is deleted and recreated. The node of each updated pod remains the same (as in the case of deleting a Pod directly).

Cluster recreation (delete CR then create)

Once a Cluster CR is deleted the PVCs remain intact. If the PVC, PV, and underlying data remain and a new Cluster CR is created, an arbitrary set of nodes will be used. (If the replicas == nodes, it would be the same set, but in arbitrary order).
In principle this should "recover" the previous cluster because the state is wrapped by the PVCs. Because we set the node_id based on the StatefulSet ordinal and the ordering changes across StatefulSet recreations, the node_id set at initialization can be different from that in the underlying data and that leads to an exception in the Redpanda process. The root cause can be solved by checking if the underlying data already contains a node_id. rpk doesn't provide a get config but in principle we could solve this problem.
If the PVCs are deleted and the reclaim policy is "retain", the PVs will remain intact. When the StatefulSet is recreated, new PVCs will be created. The rest is as above.
If the PVs are deleted (but not the data) and recreated it makes no difference to the above case as soon as the PV specs include the node name.
If the PVs are deleted and the data is deleted we start fresh.

Drawbacks

Use of local devices comes with the risk of data loss. For example, GCE considers their local SSDs ephemeral and are deleted when a VM is stopped (or deleted). Use of network attached devices may be applicable for low performance use cases. Redpanda has topic replicas so this may not be as problematic, however, backing up data to a cloud object store like S3 can be useful. Finally, other solutions involving multi-region clusters may be applicable.

The PV requires a manual cleanup and deletion (unless an external static provisioner is used).

The proposed feature is optional, so UX is affected only if one decides to use local devices. In that case, users must ensure local devices are available and to create a list of PVs. Moreover, PVs must be recreated if released (upon StatefulSet or Cluster CR deletion). This process must be part of a guide.

Rationale and Alternatives

We have a number of alternatives depending on the level of automation and reliance on k8s functionality.

Approach 1
Operator mounts /dev , finds the device, formats it, and mounts it under the data directory. This could happen using an init container. No PV or PVC is created. Cons: no k8s awareness of the device, additional responsibility for operator.

Approach 2

As in approach (1) but create a block PV pointing to the device and mount as a PV to Pod. Remaining is the same: format it, and mount it under the data directory. This could happen using an init container. No PVC is created. Cons: similar to approach 1.

Approach 3 (tried and proposed)

Create a PV of type local with fsType:xfs and set the path to the block device. Create PVC to claim the PV and use as normal. The device is formatted by Kubernetes (local process). Cons: afaik, xfs formatting options are not exposed. Pros: clean, uses built-in features

Approach 4 (proposed for next iteration)

As in approach (3) but the PVs are automatically created by a static provisioner that runs on each node and monitors for eligible devices. (It performs some of the tasks that approach 1 and 2 would). The provisioner also monitors for PVs that are "released" and recreates them. Pro: as in approach (3) + eliminates part of the manual steps; Con: requires deployment of DaemonSet and adds dependency. Provisioner: https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner. Bonus: used by Uber's M3DB (https://github.com/m3db/m3db-operator) as described here https://kubernetes.io/blog/2019/04/04/kubernetes-1.14-local-persistent-volumes-ga/

Unresolved questions

Local device in cloud offerings can be considered ephemeral.

alenkacz · 2021-02-24T16:46:42Z

alenkacz
Feb 24, 2021
Collaborator

This is amazing, very well researched 👏 I don't really have comments here, I like the proposed way and I think the change in controller code is actually reasonably small. I just wonder what will be our test plan for this but I assume we don't need to particularly test that since we would be testing kubernetes.

Maybe we should also move this rfc into docs/rfc folder?

0 replies

dotnwat · 2021-03-05T03:35:11Z

dotnwat
Mar 5, 2021
Maintainer

This is great, thanks @dimitriscruz and sorry for the delay getting to look at this!

Does this design enable reuse of lots of the storage bits if we wanted to deploy to a network mounted drive like ebs? My guess is that it should be fairly easy since there are already provisioners for ebs that will give us PVs without the manual steps involved with local devices.
One advantage of using something like hostPath is that we can provide a regex for local devices and then go out and consume them--no need to initially create the PV resources. This is definitely a pattern of the past, but something Rook did because they had to (IIRC). Is that still valuable or perhaps it is simply subsumed now by new and better approaches?
I think that the ephemeral drives in google are a problem we have to work around no matter what solution we use on k8s side. But my understanding there is that it sounds much worse than it is.
What is the manual process look like for creating PVs? Is that effectively creating some yaml files that point at /dev/* and then submitting them via kubectl? I recall seeing some project long ago that was effectively an operator that could be installed which had some automation around the creation of these PVs.

1 reply

dimitriscruz Mar 5, 2021
Author

In this design, the Cluster CR offers a storage class field. If a k8s cluster has an EBS-backed storage class it can pass its name and then the PVCs/PVs would be generated dynamically.
Creating PVs is a disadvantage but helps with management. The 4th approach in the RFC will allow us to use regex to specify devices and will automatically create the PVs. Deletion is still up to us. This is not to say that hostPath is out of the picture. For example, if we needed to format the device with specific arguments we might need to revisit.
Agree.
The manual process requires an admin to create the PVs (through kubectl or k8s client API). Here is the part that provides the device path:

  local:
    path: /dev/disk/by-id/nvme-device
    fsType: xfs

The project you're mentioning might be https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner. We could certainly use it to automate PV creation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local XFS devices in Kubernetes with Redpanda Operator #681

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Local XFS devices in Kubernetes with Redpanda Operator #681

dimitriscruz Feb 23, 2021

Executive Summary

What is being proposed

Why (short reason)

How (short plan)

Impact

Guide-level explanation

Background Concepts

Example

Reference-level explanation

Detailed design - What needs to change to get there

Detailed design - How it works

Scenarios

Drawbacks

Rationale and Alternatives

Unresolved questions

Replies: 2 comments · 1 reply

alenkacz Feb 24, 2021 Collaborator

dotnwat Mar 5, 2021 Maintainer

dimitriscruz Mar 5, 2021 Author

dimitriscruz
Feb 23, 2021

Replies: 2 comments 1 reply

alenkacz
Feb 24, 2021
Collaborator

dotnwat
Mar 5, 2021
Maintainer

dimitriscruz Mar 5, 2021
Author