Storage capacity tracked ephemeral volumes: EmptyDir on steroids

Storage capacity tracked ephemeral volumes: EmptyDir on steroids

Some applications also need to store data, but they are quite comfortable with the fact that the data will not be saved after a restart.

For example, caching services are limited in RAM, but can also move data that is rarely used to storage that is slower than RAM, with little impact on overall performance. Other applications need to be aware that the files may contain some read-only input, such as settings or secret keys.

Kubernetes already has several types ephemeral volumes, but their functionality is limited to what is implemented in K8s.

Ephemeral CSI volumes allowed Kubernetes to be extended with CSI drivers to provide support for lightweight local volumes. In this way it is possible to apply arbitrary structures: settings, secrets, identification data, variables and so on. CSI drivers need to be tweaked to support this Kubernetes feature, as it is expected that regular standardized drivers will not work - but it is assumed that such volumes can be used on any node chosen for a pod.

This can be a problem for volumes with significant node resource consumption, or for storage that is only available on some nodes. Therefore, Kubernetes 1.19 introduces two new features for alpha testing volumes that are conceptually similar to EmptyDir volumes:

  • general-purpose ephemeral volumes;

  • CSI storage capacity monitoring.

Benefits of the new approach:

  • storage can be local or network connected;

  • volumes can have a specified size that cannot be exceeded by the application;

  • works with any CSI drivers that support provisioning persistent volumes and (to support capacity tracking) call GetCapacity;

  • volumes may have some initial data depending on the driver and parameters;

  • all typical volume operations (creating a snapshot, resizing, etc.) are supported;

  • volumes can be used with any application controller that accepts a module or volume specification;

  • The Kubernetes scheduler chooses the appropriate nodes itself, so you no longer need to provide and configure scheduler extensions and modify webhooks.

Applications

Therefore, general-purpose ephemeral volumes are suitable for the following use cases:

Persistent memory as a replacement for RAM for memcached

Latest releases of memcached added support use of persistent memory (Intel Optane, etc., approx. translator) instead of regular RAM. When deploying memcached through the application controller, you can use general-purpose ephemeral volumes to request the allocation of a volume of a given size from PMEM using the CSI driver, for example PMEM-CSI.

Local LVM storage as workspace

Applications that work with data larger than RAM can request local storage with a size or performance metrics that regular Kubernetes EmptyDir volumes cannot provide. For example, for this purpose was written TopoLVM.

Read-only access for data volumes

Allocating a volume can result in the creation of a full volume when:

These volumes can be mounted in read-only mode.

How it works

General Purpose Ephemeral Volumes

A key feature of general-purpose ephemeral volumes is the new volume source, EphemeralVolumeSourceA that contains all the fields to create a volume request (historically called a request for a persistent volume, PVC). New controller in kube-controller-manager Looks up the Pods creating that volume source and then creates a PVC for those Pods. For the CSI driver, this request looks the same as the others, so no special support is needed here.

As long as such PVCs exist, they can be used like any other requests for a volume. In particular, they can be referenced as a data source when copying a volume or creating a snapshot from a volume. The PVC object also contains the current state of the volume.

The names of the automatically created PVCs are predefined: they are a combination of a pod name and a volume name, separated by a hyphen. Predefined names make it easier to interact with the PVC because it doesn't have to be looked up if the pod name and volume name are known. The downside is that the name may already be in use, which is detected by Kubernetes and as a result the Pod is blocked from starting.

In order to make sure that the volume is deleted along with the pod, the controller makes a query on the volume under the owner. When a pod is deleted, it performs the regular garbage collection mechanism, which deletes both the request and the volume.

Requests are mapped to a storage driver through the normal storage class mechanism. Although classes with immediate and late binding (they are WaitForFirstConsumer) are supported, for ephemeral volumes it makes sense to use WaitForFirstConsumer, then the scheduler can consider both node usage and storage availability when choosing a node. Here comes a new feature.

Storage capacity tracking

The scheduler usually has no idea where the CSI driver will create the volume. There is also no way for the scheduler to contact the driver directly to request this information. Therefore, the scheduler polls nodes until it finds one on which volumes can be available (late binding), or leaves the choice of location entirely up to the driver (immediate binding).

New API CSIStorageCapacity, which is in the alpha stage, allows the necessary data to be stored in etcd so that it is available to the scheduler. Unlike support for general-purpose ephemeral volumes, you must enable storage capacity monitoring when you deploy the driver: external-provisioner must publish the capacity information received from the driver through the normal GetCapacity.

If the scheduler needs to select a node for a pod with an unbound volume that uses late binding, and the deployment driver enabled this feature by setting the flag CSIDriver.storageCapacity, nodes that do not have enough storage capacity will be automatically dropped. This works for both ephemeral general purpose and persistent volumes, but not for ephemeral CSI volumes because their settings cannot be read by Kubernetes.

As usual, immediately-bound volumes are created before pod scheduling, and their placement is chosen by the storage driver, so when configuring external-provisioner by default, immediately bound storage classes are skipped, since this data will not be used anyway.

Because the kubernetes scheduler is forced to work with potentially stale information, there is no guarantee that capacity will be available any time a volume is created, but it does increase the chances that it will be created without retries.

Note You can get more detailed information, as well as safely β€œtrain on a cat stand”, and in case of a completely incomprehensible situation, get qualified help from technical support at intensives - Kubernetes Base will be held on September 28-30, and for more advanced specialists Kubernetes Mega October 14–16.

Security

CSIStorageCapacity

CSIStorageCapacity objects are in namespaces, when rolling out each CSI driver in its own namespace, it is recommended to limit the RBAC rights for CSIStorageCapacity in this space, since it is obvious where the data comes from. In any case, Kubernetes does not check for this, and usually the drivers are put in the same namespace, so in the end it is expected that the drivers will work and will not publish incorrect data (and here the card flooded me, approx. translator based on a bearded joke)

General Purpose Ephemeral Volumes

If users have rights to create a pod (directly or indirectly) - they will also be able to create general-purpose ephemeral volumes even if they do not have rights to create a request for a volume. This is because RBAC rights checks are applied to the controller that creates the PVC, not to the user. This is the main change to add to accountbefore enabling this feature on clusters if untrusted users should not be allowed to create volumes.

Example

Separate branch PMEM-CSI contains all the necessary changes to run a Kubernetes 1.19 cluster inside QEMU virtual machines with all the features in the alpha stage. The driver code has not changed, only the deployment has changed.

On a suitable machine (Linux, a normal user can use Docker, see here details) these commands will bring up the cluster and install the PMEM-CSI driver:

git clone --branch=kubernetes-1-19-blog-post https://github.com/intel/pmem-csi.git
cd pmem-csi
export TEST_KUBERNETES_VERSION=1.19 TEST_FEATURE_GATES=CSIStorageCapacity=true,GenericEphemeralVolume=true TEST_PMEM_REGISTRY=intel
make start && echo && test/setup-deployment.sh

After everything has worked, the output will contain instructions for use:

The test cluster is ready. Log in with [...]/pmem-csi/_work/pmem-govm/ssh.0, run
kubectl once logged in.  Alternatively, use kubectl directly with the
following env variable:
   KUBECONFIG=[...]/pmem-csi/_work/pmem-govm/kube.config

secret/pmem-csi-registry-secrets created
secret/pmem-csi-node-secrets created
serviceaccount/pmem-csi-controller created
...
To try out the pmem-csi driver ephemeral volumes:
   cat deploy/kubernetes-1.19/pmem-app-ephemeral.yaml |
   [...]/pmem-csi/_work/pmem-govm/ssh.0 kubectl create -f -

CSIStorageCapacity objects are not meant to be read by humans, so some processing is needed. Using template filters in Golang will show the storage classes, in this example the name, topology and capacity will be shown:

$ kubectl get 
        -o go-template='{{range .items}}{{if eq .storageClassName "pmem-csi-sc-late-binding"}}{{.metadata.name}} {{.nodeTopology.matchLabels}} {{.capacity}}
{{end}}{{end}}' 
        csistoragecapacities
csisc-2js6n map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2] 30716Mi
csisc-sqdnt map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker1] 30716Mi
csisc-ws4bv map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker3] 30716Mi

A single object has the following content:

$ kubectl describe csistoragecapacities/csisc-6cw8j
Name:         csisc-sqdnt
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  storage.k8s.io/v1alpha1
Capacity:     30716Mi
Kind:         CSIStorageCapacity
Metadata:
  Creation Timestamp:  2020-08-11T15:41:03Z
  Generate Name:       csisc-
  Managed Fields:
    ...
  Owner References:
    API Version:     apps/v1
    Controller:      true
    Kind:            StatefulSet
    Name:            pmem-csi-controller
    UID:             590237f9-1eb4-4208-b37b-5f7eab4597d1
  Resource Version:  2994
  Self Link:         /apis/storage.k8s.io/v1alpha1/namespaces/default/csistoragecapacities/csisc-sqdnt
  UID:               da36215b-3b9d-404a-a4c7-3f1c3502ab13
Node Topology:
  Match Labels:
    pmem-csi.intel.com/node:  pmem-csi-pmem-govm-worker1
Storage Class Name:           pmem-csi-sc-late-binding
Events:                       <none>

Let's try to create a demo application with a single general purpose ephemeral volume. File contents pmem-app-ephemeral.yaml:

# This example Pod definition demonstrates
# how to use generic ephemeral inline volumes
# with a PMEM-CSI storage class.
kind: Pod
apiVersion: v1
metadata:
  name: my-csi-app-inline-volume
spec:
  containers:
    - name: my-frontend
      image: intel/pmem-csi-driver-test:v0.7.14
      command: [ "sleep", "100000" ]
      volumeMounts:
      - mountPath: "/data"
        name: my-csi-volume
  volumes:
  - name: my-csi-volume
    ephemeral:
      volumeClaimTemplate:
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 4Gi
          storageClassName: pmem-csi-sc-late-binding

After creating, as shown in the instructions above, we have an additional pod and PVC:

$ kubectl get pods/my-csi-app-inline-volume -o wide
NAME                       READY   STATUS    RESTARTS   AGE     IP          NODE                         NOMINATED NODE   READINESS GATES
my-csi-app-inline-volume   1/1     Running   0          6m58s   10.36.0.2   pmem-csi-pmem-govm-worker1   <none>           <none>
$ kubectl get pvc/my-csi-app-inline-volume-my-csi-volume
NAME                                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
my-csi-app-inline-volume-my-csi-volume   Bound    pvc-c11eb7ab-a4fa-46fe-b515-b366be908823   4Gi        RWO            pmem-csi-sc-late-binding   9m21s

PVC owner - under:

$ kubectl get -o yaml pvc/my-csi-app-inline-volume-my-csi-volume
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: pmem-csi.intel.com
    volume.kubernetes.io/selected-node: pmem-csi-pmem-govm-worker1
  creationTimestamp: "2020-08-11T15:44:57Z"
  finalizers:
  - kubernetes.io/pvc-protection
  managedFields:
    ...
  name: my-csi-app-inline-volume-my-csi-volume
  namespace: default
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Pod
    name: my-csi-app-inline-volume
    uid: 75c925bf-ca8e-441a-ac67-f190b7a2265f
...

Expectedly updated information for pmem-csi-pmem-govm-worker1:

csisc-2js6n map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2] 30716Mi
csisc-sqdnt map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker1] 26620Mi
csisc-ws4bv map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker3] 30716Mi

If another application needs more than 26620Mi, the scheduler will not take into account pmem-csi-pmem-govm-worker1 in any arrangement.

What's next?

Both features are still in development. Several tickets were opened during alpha testing. The improvement suggestion links document the work that needs to be done to move to the beta stage, as well as what alternatives have already been considered and rejected:

Source: habr.com

Add a comment