Storage capacity tracked ephemeral volumes: EmptyDir on steroids
Some applications also need to store data, but they are quite comfortable with the fact that the data will not be saved after a restart.
For example, caching services are limited in RAM, but can also move data that is rarely used to storage that is slower than RAM, with little impact on overall performance. Other applications need to be aware that the files may contain some read-only input, such as settings or secret keys.
Kubernetes already has several types ephemeral volumes, but their functionality is limited to what is implemented in K8s.
Ephemeral CSI volumes allowed Kubernetes to be extended with CSI drivers to provide support for lightweight local volumes. In this way it is possible to apply arbitrary structures: settings, secrets, identification data, variables and so on. CSI drivers need to be tweaked to support this Kubernetes feature, as it is expected that regular standardized drivers will not work - but it is assumed that such volumes can be used on any node chosen for a pod.
This can be a problem for volumes with significant node resource consumption, or for storage that is only available on some nodes. Therefore, Kubernetes 1.19 introduces two new features for alpha testing volumes that are conceptually similar to EmptyDir volumes:
general-purpose ephemeral volumes;
CSI storage capacity monitoring.
Benefits of the new approach:
storage can be local or network connected;
volumes can have a specified size that cannot be exceeded by the application;
works with any CSI drivers that support provisioning persistent volumes and (to support capacity tracking) call GetCapacity;
volumes may have some initial data depending on the driver and parameters;
all typical volume operations (creating a snapshot, resizing, etc.) are supported;
volumes can be used with any application controller that accepts a module or volume specification;
The Kubernetes scheduler chooses the appropriate nodes itself, so you no longer need to provide and configure scheduler extensions and modify webhooks.
Applications
Therefore, general-purpose ephemeral volumes are suitable for the following use cases:
Persistent memory as a replacement for RAM for memcached
Latest releases of memcached added support use of persistent memory (Intel Optane, etc., approx. translator) instead of regular RAM. When deploying memcached through the application controller, you can use general-purpose ephemeral volumes to request the allocation of a volume of a given size from PMEM using the CSI driver, for example PMEM-CSI.
Local LVM storage as workspace
Applications that work with data larger than RAM can request local storage with a size or performance metrics that regular Kubernetes EmptyDir volumes cannot provide. For example, for this purpose was written TopoLVM.
Read-only access for data volumes
Allocating a volume can result in the creation of a full volume when:
A key feature of general-purpose ephemeral volumes is the new volume source, EphemeralVolumeSourceA that contains all the fields to create a volume request (historically called a request for a persistent volume, PVC). New controller in kube-controller-manager Looks up the Pods creating that volume source and then creates a PVC for those Pods. For the CSI driver, this request looks the same as the others, so no special support is needed here.
As long as such PVCs exist, they can be used like any other requests for a volume. In particular, they can be referenced as a data source when copying a volume or creating a snapshot from a volume. The PVC object also contains the current state of the volume.
The names of the automatically created PVCs are predefined: they are a combination of a pod name and a volume name, separated by a hyphen. Predefined names make it easier to interact with the PVC because it doesn't have to be looked up if the pod name and volume name are known. The downside is that the name may already be in use, which is detected by Kubernetes and as a result the Pod is blocked from starting.
In order to make sure that the volume is deleted along with the pod, the controller makes a query on the volume under the owner. When a pod is deleted, it performs the regular garbage collection mechanism, which deletes both the request and the volume.
Requests are mapped to a storage driver through the normal storage class mechanism. Although classes with immediate and late binding (they are WaitForFirstConsumer) are supported, for ephemeral volumes it makes sense to use WaitForFirstConsumer, then the scheduler can consider both node usage and storage availability when choosing a node. Here comes a new feature.
Storage capacity tracking
The scheduler usually has no idea where the CSI driver will create the volume. There is also no way for the scheduler to contact the driver directly to request this information. Therefore, the scheduler polls nodes until it finds one on which volumes can be available (late binding), or leaves the choice of location entirely up to the driver (immediate binding).
New APICSIStorageCapacity, which is in the alpha stage, allows the necessary data to be stored in etcd so that it is available to the scheduler. Unlike support for general-purpose ephemeral volumes, you must enable storage capacity monitoring when you deploy the driver: external-provisioner must publish the capacity information received from the driver through the normal GetCapacity.
If the scheduler needs to select a node for a pod with an unbound volume that uses late binding, and the deployment driver enabled this feature by setting the flag CSIDriver.storageCapacity, nodes that do not have enough storage capacity will be automatically dropped. This works for both ephemeral general purpose and persistent volumes, but not for ephemeral CSI volumes because their settings cannot be read by Kubernetes.
As usual, immediately-bound volumes are created before pod scheduling, and their placement is chosen by the storage driver, so when configuring external-provisioner by default, immediately bound storage classes are skipped, since this data will not be used anyway.
Because the kubernetes scheduler is forced to work with potentially stale information, there is no guarantee that capacity will be available any time a volume is created, but it does increase the chances that it will be created without retries.
Note You can get more detailed information, as well as safely βtrain on a cat standβ, and in case of a completely incomprehensible situation, get qualified help from technical support at intensives - Kubernetes Base will be held on September 28-30, and for more advanced specialists Kubernetes Mega October 14β16.
Security
CSIStorageCapacity
CSIStorageCapacity objects are in namespaces, when rolling out each CSI driver in its own namespace, it is recommended to limit the RBAC rights for CSIStorageCapacity in this space, since it is obvious where the data comes from. In any case, Kubernetes does not check for this, and usually the drivers are put in the same namespace, so in the end it is expected that the drivers will work and will not publish incorrect data (and here the card flooded me, approx. translator based on a bearded joke)
General Purpose Ephemeral Volumes
If users have rights to create a pod (directly or indirectly) - they will also be able to create general-purpose ephemeral volumes even if they do not have rights to create a request for a volume. This is because RBAC rights checks are applied to the controller that creates the PVC, not to the user. This is the main change to add to accountbefore enabling this feature on clusters if untrusted users should not be allowed to create volumes.
Example
Separate branch PMEM-CSI contains all the necessary changes to run a Kubernetes 1.19 cluster inside QEMU virtual machines with all the features in the alpha stage. The driver code has not changed, only the deployment has changed.
On a suitable machine (Linux, a normal user can use Docker, see here details) these commands will bring up the cluster and install the PMEM-CSI driver:
git clone --branch=kubernetes-1-19-blog-post https://github.com/intel/pmem-csi.git
cd pmem-csi
export TEST_KUBERNETES_VERSION=1.19 TEST_FEATURE_GATES=CSIStorageCapacity=true,GenericEphemeralVolume=true TEST_PMEM_REGISTRY=intel
make start && echo && test/setup-deployment.sh
After everything has worked, the output will contain instructions for use:
The test cluster is ready. Log in with [...]/pmem-csi/_work/pmem-govm/ssh.0, run
kubectl once logged in. Alternatively, use kubectl directly with the
following env variable:
KUBECONFIG=[...]/pmem-csi/_work/pmem-govm/kube.config
secret/pmem-csi-registry-secrets created
secret/pmem-csi-node-secrets created
serviceaccount/pmem-csi-controller created
...
To try out the pmem-csi driver ephemeral volumes:
cat deploy/kubernetes-1.19/pmem-app-ephemeral.yaml |
[...]/pmem-csi/_work/pmem-govm/ssh.0 kubectl create -f -
CSIStorageCapacity objects are not meant to be read by humans, so some processing is needed. Using template filters in Golang will show the storage classes, in this example the name, topology and capacity will be shown:
Let's try to create a demo application with a single general purpose ephemeral volume. File contents pmem-app-ephemeral.yaml:
# This example Pod definition demonstrates
# how to use generic ephemeral inline volumes
# with a PMEM-CSI storage class.
kind: Pod
apiVersion: v1
metadata:
name: my-csi-app-inline-volume
spec:
containers:
- name: my-frontend
image: intel/pmem-csi-driver-test:v0.7.14
command: [ "sleep", "100000" ]
volumeMounts:
- mountPath: "/data"
name: my-csi-volume
volumes:
- name: my-csi-volume
ephemeral:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
storageClassName: pmem-csi-sc-late-binding
After creating, as shown in the instructions above, we have an additional pod and PVC:
$ kubectl get pods/my-csi-app-inline-volume -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
my-csi-app-inline-volume 1/1 Running 0 6m58s 10.36.0.2 pmem-csi-pmem-govm-worker1 <none> <none>
$ kubectl get pvc/my-csi-app-inline-volume-my-csi-volume
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
my-csi-app-inline-volume-my-csi-volume Bound pvc-c11eb7ab-a4fa-46fe-b515-b366be908823 4Gi RWO pmem-csi-sc-late-binding 9m21s
If another application needs more than 26620Mi, the scheduler will not take into account pmem-csi-pmem-govm-worker1 in any arrangement.
What's next?
Both features are still in development. Several tickets were opened during alpha testing. The improvement suggestion links document the work that needs to be done to move to the beta stage, as well as what alternatives have already been considered and rejected: