Amazon EKS Windows in GA with bugs, but the fastest

Amazon EKS Windows in GA with bugs, but the fastest

Good afternoon, I want to share with you my experience in setting up and using the AWS EKS (Elastic Kubernetes Service) service for Windows containers, or rather, about the impossibility of using it, and the bug found in the AWS system container, those who are interested in this service for Windows containers, please under cat.

I know that windows containers are not a popular topic, and few people use them, but still I decided to write this article, since there were a couple of articles on kubernetes and windows on Habré, and there are still such people.

Home

It all started with the fact that the services in our company, it was decided to migrate to kubernetes, these are 70% windows and 30% linux. For this, the AWS EKS cloud service was considered as one of the possible options. Until October 8, 2019, AWS EKS Windows was in Public Preview, I started with it, the old 1.11 version of kubernetes was used there, but I decided to check it anyway and see at what stage this cloud service is working at all, as it turned out, there was a bug with the addition of deleting pods, while the old ones stopped responding to the internal ip from the same subnet as the windows worker node.

Therefore, it was decided to abandon the use of AWS EKS in favor of our own cluster on kubernetes on the same EC2, only all the balancing and HA would have to be described by myself through CloudFormation.

Amazon EKS Windows Container Support now Generally Available

by Martin Beeby | on Oct 08, 2019

Before I had time to add a template to CloudFormation for my own cluster, I saw this news Amazon EKS Windows Container Support now Generally Available

Of course, I put all my work aside, and began to study what they did for GA, and how things have changed with the Public Preview. Yes, well done AWS updated images for windows worker node to version 1.14 as well as the cluster itself version 1.14 in EKS now with support for windows nodes. Project by Public Preview on githabe they covered up and said now use the official documentation here: EKS Windows Support

Integration of the EKS cluster into the current VPC and subnets

In all sources, in the link above on the announcement as well as in the documentation, it was proposed to deploy the cluster either through the proprietary eksctl utility or through CloudFormation + kubectl after, only using public subnets in Amazon, as well as creating a separate VPC for a new cluster.

This option is not suitable for many, firstly, a separate VPC is an additional cost for its cost + peering traffic to your current VPC. What about those who already have a ready-made infrastructure in AWS with their Multiple AWS accounts, VPC, subnets, route tables, transit gateway and so on? Of course, you don’t want to break or redo all this, and you need to integrate the new EKS cluster into the current network infrastructure, using the existing VPC, and to create a maximum of new subnets for the cluster to separate.

In my case, this path was chosen, I used the existing VPC, added only 2 public subnets and 2 private subnets for the new cluster, of course, all the rules were taken into account according to the documentation Create your Amazon EKS Cluster VPC.

There was also one condition - no worker node in public subnets using EIP.

eksctl vs CloudFormation

I will make a reservation right away that I tried both methods of deploying the cluster, in both cases the picture was the same.

I will show an example only using eksctl, since the code will be shorter here. Deploy cluster using eksctl in 3 steps:

1. We create the cluster itself + Linux worker node, which will later host system containers and the same ill-fated vpc-controller.

eksctl create cluster 
--name yyy 
--region www 
--version 1.14 
--vpc-private-subnets=subnet-xxxxx,subnet-xxxxx 
--vpc-public-subnets=subnet-xxxxx,subnet-xxxxx 
--asg-access 
--nodegroup-name linux-workers 
--node-type t3.small 
--node-volume-size 20 
--ssh-public-key wwwwwwww 
--nodes 1 
--nodes-min 1 
--nodes-max 2 
--node-ami auto 
--node-private-networking

In order to deploy to an existing VPC, it is enough to specify the id of your subnets, and eksctl will determine the VPC itself.

In order for your worker node to deploy only to the private subnet, you need to specify --node-private-networking for the nodegroup.

2. Install vpc-controller in our cluster, which will then process our worker nodes, counting the number of free ip addresses, as well as the number of ENI on the instance, adding and removing them.

eksctl utils install-vpc-controllers --name yyy --approve

3. After your system containers have successfully started on your linux worker node, including vpc-controller, all that remains is to create another nodegroup with windows workers.

eksctl create nodegroup 
--region www 
--cluster yyy 
--version 1.14 
--name windows-workers 
--node-type t3.small 
--ssh-public-key wwwwwwwwww 
--nodes 1 
--nodes-min 1 
--nodes-max 2 
--node-ami-family WindowsServer2019CoreContainer 
--node-ami ami-0573336fc96252d05 
--node-private-networking

After your node has successfully hooked to your cluster and everything seems to be fine, it is in the Ready status, but no.

Error in vpc-controller

If we try to run pods on a windows worker node, we will get an error:

NetworkPlugin cni failed to teardown pod "windows-server-iis-7dcfc7c79b-4z4v7_default" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address]

If we look deeper, we see that our instance in AWS looks like this:

Amazon EKS Windows in GA with bugs, but the fastest

And it should be like this:

Amazon EKS Windows in GA with bugs, but the fastest

From this it is clear that the vpc-controller did not work out its part for some reason and could not add new ip-addresses to the instance so that pods could use them.

We climb to look at the vpc-controller pod logs and this is what we see:

kubectl log -n kube-system

I1011 06:32:03.910140       1 watcher.go:178] Node watcher processing node ip-10-xxx.ap-xxx.compute.internal.
I1011 06:32:03.910162       1 manager.go:109] Node manager adding node ip-10-xxx.ap-xxx.compute.internal with instanceID i-088xxxxx.
I1011 06:32:03.915238       1 watcher.go:238] Node watcher processing update on node ip-10-xxx.ap-xxx.compute.internal.
E1011 06:32:08.200423       1 manager.go:126] Node manager failed to get resource vpc.amazonaws.com/CIDRBlock  pool on node ip-10-xxx.ap-xxx.compute.internal: failed to find the route table for subnet subnet-0xxxx
E1011 06:32:08.201211       1 watcher.go:183] Node watcher failed to add node ip-10-xxx.ap-xxx.compute.internal: failed to find the route table for subnet subnet-0xxx
I1011 06:32:08.201229       1 watcher.go:259] Node watcher adding key ip-10-xxx.ap-xxx.compute.internal (0): failed to find the route table for subnet subnet-0xxxx
I1011 06:32:08.201302       1 manager.go:173] Node manager updating node ip-10-xxx.ap-xxx.compute.internal.
E1011 06:32:08.201313       1 watcher.go:242] Node watcher failed to update node ip-10-xxx.ap-xxx.compute.internal: node manager: failed to find node ip-10-xxx.ap-xxx.compute.internal.

Google searches did not lead to anything, since apparently no one has caught such a bug yet, well, or did not post an issue on it, I had to think of the options myself first. The first thing that came to mind was that perhaps the vpc-controller cannot resolve ip-10-xxx.ap-xxx.compute.internal and reach it, and therefore errors are thrown.

Yes, indeed, we use custom dns servers in the VPC and, in principle, we don’t use Amazon ones, so even forwarding was not configured for this ap-xxx.compute.internal domain. I checked this option, and it did not bring results, perhaps the test was not clean, and therefore, when communicating with technical support, I succumbed to their idea.

Since there weren’t many ideas, all security groups were created by eksctl himself, so there was no doubt that they were working, the routing tables were also correct, nat, dns, there was also Internet access with worker nodes.

At the same time, if you deploy a worker node to the public subnet without using --node-private-networking, this node was immediately updated by vpc-controller and everything worked like clockwork.

There were two options:

  1. To score and wait until someone describes this bug in AWS and they fix it, and then you can safely use AWS EKS Windows, because they have just been released in GA (8 days have passed at the time of writing), for sure many will follow the same path as me .
  2. Write to AWS Support and tell them the essence of the problem with a whole bunch of logs from everywhere and prove to them that their service does not work when using their own VPC and subnets, it was not in vain that we had Business support, we must use it at least once 🙂

Communication with AWS engineers

Having created a ticket on the portal, I mistakenly chose to answer me via Web - email or support center, through this option they can answer you after a few days at all, despite the fact that my ticket had Severity - System impaired, which meant a response within <12 hours, and since the Business support plan has 24/7 support, I hoped for the best, but it turned out as always.

My ticket was in Unassigned from Friday to Mon, then I decided to write to them again and chose the Chat answer option. After a short wait, Harshad Madhav was assigned to me, and then it began ...

We debugged it online for 3 hours in a row, transferring logs, deploying the same cluster in the AWS lab to emulate the problem, recreating the cluster on my part, and so on, the only thing we came up with was that the logs showed that the resolution was not working internal AWS domain names, which I wrote about above, and Harshad Madhav asked me to create a forwarding, supposedly we use custom DNS and this could be a problem.

Forwarding

ap-xxx.compute.internal  -> 10.x.x.2 (VPC CIDRBlock)
amazonaws.com -> 10.x.x.2 (VPC CIDRBlock)

Which was done, the day ended Harshad Madhav unsubscribed that check and it should work, but no, the resolution did not help.

Then there was communication with 2 more engineers, one just fell off the chat, apparently scared of a complex case, the second spent my day again on a full debug cycle, sending logs, creating clusters on both sides, in the end he just said well, it works for me, here I am official documentation do everything step by step do it and you and you will succeed.

To which I politely asked him to leave, and assign another to my ticket, if you do not know where to look for the problem.

The final

On the third day, a new engineer Arun B. was assigned to me, and from the very beginning of communication with him it was immediately clear that these were not 3 previous engineers. He read the entire history and immediately asked to collect the logs with his self-written script on ps1, which was on his github. This was followed again by all the iterations of creating clusters, displaying the results of commands, collecting logs, but Arun B. was moving in the right direction, judging by the questions asked to me.

When we got to the point of enabling -stderrthreshold=debug in their vpc-controller, what happened next? it doesn't work of course) the pod just doesn't start with this option, only -stderrthreshold=info works.

We ended up with this and Arun B. said that he would try to reproduce my steps in order to get the same error. The next day I get a response from Arun B. He did not abandon this case, but took up the review code of his vpc-controller and found the same place where it is and why it does not work:

Amazon EKS Windows in GA with bugs, but the fastest

Thus, if you use the main route table in your VPC, then by default it does not have associations with the necessary subnets, so necessary for the vpc-controller, in the case of a public subnet, it has a custom route table that has an association.

By manually adding associations for the main route table with the necessary subnets, and recreating the nodegroup, everything works perfectly.

I hope that Arun B. will actually report this bug to the EKS developers and we will see a new version of vpc-controller where everything will work out of the box. Currently latest version: 602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/eks/vpc-resource-controller:0.2.1
has this problem.

Thanks to everyone who read to the end, test everything that you are going to use in production before implementation.

Source: habr.com

Add a comment