
Today Wednesday the next release of Kubernetes is 1.16. According to the tradition that has developed for our blog, for the tenth anniversary time we are talking about the most significant changes in the new version.
The information used to prepare this material is taken from , and related issues, pull requests, and Kubernetes Enhancement Proposals (KEP). So, let's go! ..
Nodes
A truly large number of notable innovations (in alpha version status) are presented on the side of the nodes of K8s clusters (Kubelet).
First, there are the so-called «» (Ephemeral Containers), designed to simplify debugging processes in pods. The new mechanism allows you to run special containers that start in the namespace of existing pods and live for a short time. Their purpose is to interact with other pods and containers in order to solve any problems and debug. A new command has been implemented for this feature kubectl debug, similar in essence to kubectl exec: only instead of running a process in a container (as in the case exec) it starts the container in the pod. For example, this command will attach a new container to a pod:
kubectl debug -c debug-shell --image=debian target-pod -- bashDetails about ephemeral containers (and examples of how to use them) can be found in . The current implementation (in K8s 1.16) is an alpha version, and among the criteria for its transition to a beta version is "testing the Ephemeral Containers API for at least 2 releases [Kubernetes]".
NB: In its essence and even the name of the feature resembles an existing plugin about which we . It is assumed that with the advent of ephemeral containers, the development of a separate external plugin will stop.
Another innovation - - intended to provide pod overhead calculation mechanism, which can be very different depending on the executable environment (runtime) used. As an example, the authors cite Kata Containers, which require the guest kernel, kata agent, init system, etc. to be running. When overhead gets this big, it can't be ignored, which means there needs to be a way to take it into account for further quoting, scheduling, etc. For its implementation in PodSpec field added Overhead *ResourceList (compare with data in RuntimeClass, if one is used).
Another notable innovation is node topology manager (Node Topology Manager), designed to unify the approach to fine-tuning the distribution of hardware resources for various components in Kubernetes. This initiative is caused by the growing need of various modern systems (from the field of telecommunications, machine learning, financial services, etc.) for high-performance parallel computing and minimization of delays in the execution of operations, for which they use advanced CPU and hardware acceleration capabilities. Such optimizations in Kubernetes have so far been achieved thanks to disparate components (CPU manager, Device manager, CNI), and now they will add a single internal interface that unifies the approach and simplifies the connection of new similar - so-called topology-aware - components on the Kubelet side. Details - in .

Topology Manager Component Diagram
The next feature is checking containers while they are running (). As you know, for containers that take a long time to launch, it is difficult to get an up-to-date status: they are either “killed” even before they actually start functioning, or they fall into a deadlock for a long time. New check (enabled via a feature gate called StartupProbeEnabled) cancels—or rather, postpones—any other checks until the pod has finished running. For this reason, the feature was originally called . For pods that take a long time to start, you can poll the status at relatively short time intervals.
In addition, an improvement for RuntimeClass is introduced immediately in beta status, adding support for "heterogeneous clusters". C now it is not at all necessary for each node to have support for each RuntimeClass: for pods, you can choose RuntimeClass without thinking about the cluster topology. Previously, to achieve this - so that pods ended up on nodes with support for everything they needed - you had to assign the appropriate rules to NodeSelector and tolerations. IN it talks about usage examples and, of course, implementation details.
Network
Two significant networking features that appeared for the first time (in alpha) in Kubernetes 1.16 are:
- dual network stack - IPv4/IPv6 - and its corresponding "understanding" at the level of pods, nodes, services. It includes IPv4-to-IPv4 and IPv6-to-IPv6 interaction between pods, from pods to external services, reference implementations (within the Bridge CNI, PTP CNI and Host-Local IPAM plugins), as well as reverse compatibility with Kubernetes clusters running only over IPv4 or IPv6. Implementation details are in .
An example of displaying two types of IP addresses (IPv4 and IPv6) in the list of pods:
kube-master# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE nginx-controller 1/1 Running 0 20m fd00:db8:1::2,192.168.1.3 kube-minion-1 kube-master# - New API for Endpoint . It solves the performance/scalability issues of the existing Endpoint API that affect various components in the control-plane (apiserver, etcd, endpoints-controller, kube-proxy). The new API will be added to the Discovery API group and will be able to serve tens of thousands of backend endpoints on each service in a cluster consisting of a thousand nodes. To do this, each Service is mapped to N objects
EndpointSlice, each of which has no more than 100 endpoints by default (the value is configurable). The EndpointSlice API will also provide opportunities for its future development: support for multiple IP addresses for each pod, new states for endpoints (not onlyReadyиNotReady), dynamic subsetting for endpoints.
The one presented in the last release advanced to the beta version named service.kubernetes.io/load-balancer-cleanup and attached to each service with type LoadBalancer. At the time of removal of such a service, it prevents the actual removal of the resource until the "cleanup" of all relevant resources of the balancer is completed.
API Machinery
The real "stabilization milestone" is in the area of the Kubernetes API server and interaction with it. This was largely due to transfer to the status of stable who do not need special introduction (CRD), which have had beta status since the distant Kubernetes 1.7 (and this is June 2017!). The same stabilization came to related features:
- with
/statusи/scalefor CustomResources; - versions for CRD based on an external webhook;
- (in K8s 1.15) defaults (defaulting) and automatic deletion of fields (pruning) for CustomResources;
- using the OpenAPI v3 schema to create and publish OpenAPI documentation used to validate CRD resources on the server side.
Another mechanism that has long become familiar to Kubernetes administrators: - was also in beta status for a long time (since K8s 1.9) and is now declared stable.
Two other features have reached beta: и .
And the only significant innovation in the alpha version was from SelfLink — a special URI that represents the specified object and is part of ObjectMeta и ListMeta (i.e. part of any object in Kubernetes). Why are they rejecting it? Motivation "in a simple way" as the absence of real (irresistible) reasons for this field to still exist. The more formal reasons are to optimize performance (removing an unnecessary field) and to simplify the work of generic-apiserver, which is forced to handle such a field in a special way (this is the only field that is set right before the object is serialized). Real "obsolescence" (under beta) SelfLink will happen by Kubernetes 1.20, and the final one will be 1.21.
Data Storage
The main work in the area of storage, as in previous releases, is observed in the area . The main changes here are:
- for the first time (in alpha version) support for CSI plugins for worker nodes with Windows: the current way of working with storages will also replace the in-tree plugins in the Kubernetes core and FlexVolume plugins from Microsoft based on Powershell;

Implementation diagram of CSI plugins in Kubernetes for Windows - opportunity , introduced back in K8s 1.12, has grown to beta;
- a similar "elevation" (from alpha to beta) has been achieved by the ability to use CSI to create local ephemeral volumes ().
Introduced in the previous version of Kubernetes (using existing PVCs as DataSource to create new PVCs) is also now in beta status.
Scheduler
Two notable planning changes (both in alpha):
- - opportunity use pods instead of application logical units for "fair distribution" of loads (like Deployment and ReplicaSet) and adjusting this distribution (as a hard requirement or as a soft condition, i.e. priority). The feature will expand the existing distribution options for planned pods, currently limited to options
PodAffinityиPodAntiAffinity, giving administrators more fine-grained control in this matter, which means better high availability and optimized resource consumption. Details - in . - Using BestFit Policy в RequestedToCapacityRatio Priority Function during pod scheduling, which will allow use (“packaging into containers”) for both basic resources (processor, memory) and extended ones (like GPU). For details see .

Pod scheduling: before using best fit policy (directly via default scheduler) and using it (via scheduler extender)
Additionally, the ability to create your own plugins for the scheduler outside the main Kubernetes development tree (out-of-tree).
Other changes
Also in the release of Kubernetes 1.16, you can note initiative on available metrics in full order, or more precisely, in accordance with to K8s instrumentation. They largely rely on the respective . Inconsistencies arose for various reasons (for example, some metrics were simply created even before the current instructions appeared), and the developers decided that it was time to bring everything to a single standard, "in line with the rest of the Prometheus ecosystem." The current implementation of this initiative is alpha and will be progressively upgraded to beta (1.17) and stable (1.18) in future Kubernetes releases.
In addition, the following changes can be noted:
- Development of support Windows с Kubeadm utilities for this OS (alpha version),
RunAsUserNamefor Windows- containers (alpha version), Group Managed Service Account (gMSA) support up to beta, mount/attach for vSphere volumes. - data compression mechanism in API responses. Previously, an HTTP filter was used for these purposes, which imposed a number of restrictions that prevented its inclusion by default. Now "transparent request compression" works: clients sending
Accept-Encoding: gzipin the header receive a GZIP-compressed response if its size exceeded 128 KB. Go clients automatically support compression (send the right header), so they will immediately notice a decrease in traffic. (Slight modifications may be required for other languages.) - scaling HPA from/to zero pods based on external metrics. If scaling is based on objects/external metrics, then when workloads are idle, you can automatically scale to 0 replicas to save resources. This feature should be especially useful for cases where workers are requesting GPU resources, and the number of different types of idle workers exceeds the number of available GPUs.
- New client - - for "generalized" access to objects. It is designed to easily retrieve metadata (i.e. subsection
metadata) from cluster resources and perform operations with them from the category of garbage collection and quota. - Build Kubernetes without obsolete ("built-in" in-tree) cloud providers (alpha version).
- To the kubeadm utility experimental (alpha) ability to apply customize patches during operations
init,joinиupgrade. Learn more about how to use the flag--experimental-kustomize, see in . - New endpoint for apiserver - , - allows you to export information about its readiness (readiness). Also, the API server has a flag
--maximum-startup-sequence-duration, allowing you to control its restarts. - two features for Azure declared stable: support (Availability Zones) and (RG). In addition, Azure added:
- AAD and ADFS;
-
service.beta.kubernetes.io/azure-pip-nameto specify the public IP of the load balancer; - settings
LoadBalancerNameиLoadBalancerResourceGroup.
- AWS has got for EBS in Windows и EC2 API calls
DescribeInstances. - Kubeadm is now on its own CoreDNS configuration when updating the CoreDNS version.
- Binaries etcd in the corresponding Docker image world-executable, which allows you to run this image without the need for root privileges. Also, the etcd migration image support for the etcd2 version.
- В switched to using distroless as a base image, improved performance, added new cloud providers (DigitalOcean, Magnum, Packet).
- Updates in used/dependant software: Go 1.12.9, etcd 3.3.15, CoreDNS 1.6.2.
P.S.
Read also on our blog:
- «";
- «";
- «";
- «».
Source: habr.com


