Fine-tuning routing for MetalLB in L2 mode

Fine-tuning routing for MetalLB in L2 mode
Not so long ago, I ran into a very non-standard task of setting up routing for MetalLB. Everything would be fine, because Usually MetalLB does not require any additional steps, but in our case we have a fairly large cluster with a very simple network configuration.

In this article, I will tell you how to configure source-based and policy-based routing for the external network of your cluster.

I won't go into detail on installing and configuring MetalLB as I'm assuming you already have some experience. I propose to immediately get down to business, namely to configure routing. So we have four cases:

Case 1: When setting is not required

Let's take a simple case.

Fine-tuning routing for MetalLB in L2 mode

Additional routing configuration is not required when the addresses issued by MetalLB are in the same subnet as the addresses of your nodes.

For example you have a subnet 192.168.1.0/24, it has a router 192.168.1.1, and your nodes get addresses: 192.168.1.10-30, then for MetalLB you can adjust the range 192.168.1.100-120 and be sure that they will work without any additional configuration.

Why is that? Because your nodes already have routes configured:

# ip route
default via 192.168.1.1 dev eth0 onlink 
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.10

And addresses from the same rage will reuse them without any additional gestures.

Case 2: When additional customization is required

Fine-tuning routing for MetalLB in L2 mode

You should set up additional routes whenever your nodes don't have an IP address configured or a route to the subnet for which MetalLB issues addresses.

I'll explain a little more. Whenever MetalLB returns an address, this can be compared to a simple assignment of the form:

ip addr add 10.9.8.7/32 dev lo

Pay attention to:

  • a) The address is assigned with a prefix /32 that is, the route to the subnet for it will not be automatically added (it's just an address)
  • b) The address is assigned to any node interface (for example, loopback). Here it is worth mentioning the peculiarities of the Linux networking stack. No matter which interface you add an address to, the kernel will always process arp requests and send arp responses to any of them, this behavior is considered correct and, moreover, is widely used in such a dynamic environment as Kubernetes.

This behavior can be configured, for example by enabling strict arp:

echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce

In this case, arp replies will only be sent if the interface explicitly contains a specific IP address. This setting is required if you plan to use MetalLB and your kube-proxy is running in IPVS mode.

However, MetalLB does not use the kernel to process arp requests, but does it itself in user-space, so this option will not affect MetalLB.

Let's return to our task. If the route for the issued addresses does not exist on your nodes, add it in advance to all nodes:

ip route add 10.9.8.0/24 dev eth1

Case 3: When you need source-based routing

You will need to configure source-based routing when you receive packets through a separate gateway, not the one you have configured by default, so response packets must also go through the same gateway.

For example, you have all the same subnet 192.168.1.0/24 dedicated to your nodes, but you want to give external addresses using MetalLB. Suppose you have multiple addresses from a subnet 1.2.3.0/24 located in VLAN 100 and you want to use them to access Kubernetes services from the outside.

Fine-tuning routing for MetalLB in L2 mode

When contacting 1.2.3.4 you will be making requests from a different subnet than 1.2.3.0/24 and expect a response. The node that is currently the master for the address given by MetalLB 1.2.3.4, will receive a packet from the router 1.2.3.1, but the answer for it must necessarily go by the same route, through 1.2.3.1.

Since our node already has a default gateway configured 192.168.1.1, then by default the answer will go to him, and not to 1.2.3.1through which we received the package.

How to deal with this situation?

In this case, you need to prepare all your nodes in such a way that they are ready to serve external addresses without additional configuration. That is, for the above example, you need to create a VLAN interface on the node in advance:

ip link add link eth0 name eth0.100 type vlan id 100
ip link set eth0.100 up

And then add routes:

ip route add 1.2.3.0/24 dev eth0.100 table 100
ip route add default via 1.2.3.1 table 100

Pay attention to the routes we add to a separate routing table 100 it will contain only the two routes needed to send a response packet through the gateway 1.2.3.1behind the interface eth0.100.

Now we need to add a simple rule:

ip rule add from 1.2.3.0/24 lookup 100

which explicitly says: if the source address of the packet is in 1.2.3.0/24, then you need to use the routing table 100. In it, we have already described the route that will send it through 1.2.3.1

Case 4: When you need policy-based routing

The network topology is the same as in the previous example, but let's say you also want to be able to access the external addresses of the pool 1.2.3.0/24 from your pods:

Fine-tuning routing for MetalLB in L2 mode

The peculiarity lies in the fact that when accessing any address in 1.2.3.0/24, the response packet hitting the node and having the source address in the range 1.2.3.0/24 will be dutifully sent to eth0.100, but we want Kubernetes to redirect it to our first Pod, which generated the original request.

Solving this problem was not easy, but it became possible thanks to policy-based routing:

For a better understanding of the process, I will give a netfilter block diagram:
Fine-tuning routing for MetalLB in L2 mode

To begin with, as in the previous example, let's create an additional routing table:

ip route add 1.2.3.0/24 dev eth0.100 table 100
ip route add default via 1.2.3.1 table 100

Now let's add some rules to iptables:

iptables -t mangle -A PREROUTING -i eth0.100 -j CONNMARK --set-mark 0x100
iptables -t mangle -A PREROUTING  -j CONNMARK --restore-mark
iptables -t mangle -A PREROUTING -m mark ! --mark 0 -j RETURN
iptables -t mangle -A POSTROUTING -j CONNMARK --save-mark

These rules will mark incoming connections per interface eth0.100, tagging all packages with the tag 0x100, the same tag will also be used to mark responses within the same connection.

Now we can add a routing rule:

ip rule add from 1.2.3.0/24 fwmark 0x100 lookup 100

That is, all packets with a source address 1.2.3.0/24 and tag 0x100 must be routed using a table 100.

Thus, other packets received on a different interface do not fall under this rule, which will allow them to be routed by standard Kubernetes tools.

There is one more thing, but in Linux there is a so-called reverse path filter, which spoils all the raspberries, performs a simple check: for all incoming packets, it changes the source address of the packet with the sender address and checks whether the packet can leave through the same interface on which it was received if not, it will filter it out.

The problem is that in our case it will not work correctly, but we can turn it off:

echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
echo 0 > /proc/sys/net/ipv4/conf/eth0.100/rp_filter

Please note that the first command controls the global behavior of rp_filter, if it is not disabled, the second command will have no effect. However, other interfaces will remain with rp_filter enabled.

In order not to limit the filter completely, we can use the rp_filter implementation for netfilter. Using rpfilter as an iptables module, you can set up quite flexible rules, for example:

iptables -t raw -A PREROUTING -i eth0.100 -d 1.2.3.0/24 -j RETURN
iptables -t raw -A PREROUTING -i eth0.100 -m rpfilter --invert -j DROP

enable rp_filter on the interface eth0.100 for all addresses except 1.2.3.0/24.

Source: habr.com

Add a comment