Fast routing and NAT in Linux

With the exhaustion of IPv4 addresses, many telecom operators are faced with the need to organize the access of their customers to the network using address translation. In this article, I'll show you how you can get Carrier Grade NAT performance on commodity servers.

A bit of history

The topic of IPv4 address space exhaustion is not new. At some point, waiting lists appeared in RIPE, then exchanges appeared on which blocks of addresses were traded and transactions were made to rent them. Gradually, telecom operators began to provide Internet access services using address and port translation. Someone did not have time to get enough addresses to issue a "white" address to each subscriber, and someone began to save money by refusing to buy addresses on the secondary market. Manufacturers of network equipment supported this idea, because. this functionality usually requires additional add-ons or licenses. For example, in Juniper's MX line of routers (except for the latest MX104 and MX204), NAPT can be performed on a separate MS-MIC service card, Cisco ASR1k requires a CGN license, and Cisco ASR9k requires a separate A9K-ISM-100 module and an A9K-CGN license -LIC to him. In general, pleasure costs a lot of money.

IPTables

The task of performing NAT does not require specialized computing resources, it can be solved by general-purpose processors that are installed, for example, in any home router. On the scale of a telecom operator, this problem can be solved using commodity servers running FreeBSD (ipfw/pf) or GNU/Linux (iptables). We will not consider FreeBSD, because. I gave up using this OS quite a while ago, so let's stick with GNU/Linux.

Turning on address translation is not at all difficult. First you need to write a rule in iptables to the nat table:

iptables -t nat -A POSTROUTING -s 100.64.0.0/10 -j SNAT --to <pool_start_addr>-<pool_end_addr> --persistent

The operating system will load the nf_conntrack module, which will keep track of all active connections and perform the necessary transformations. There are several subtleties here. Firstly, since we are talking about NAT on the scale of the telecom operator, it is necessary to tweak the timeouts, because with the default values, the size of the translation table will quickly grow to catastrophic values. Below is an example of the settings I used on my servers:

net.ipv4.ip_forward = 1
net.ipv4.ip_local_port_range = 8192 65535

net.netfilter.nf_conntrack_generic_timeout = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 600
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 45
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 60
net.netfilter.nf_conntrack_icmpv6_timeout = 30
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_events_retry_timeout = 15
net.netfilter.nf_conntrack_checksum=0

And secondly, since by default the size of the translation table is not designed to work in the conditions of the telecom operator, it must be increased:

net.netfilter.nf_conntrack_max = 3145728

You also need to increase the number of buckets for the hash table that stores all translations (this is an option of the nf_conntrack module):

options nf_conntrack hashsize=1572864

After these simple manipulations, a completely working design is obtained, which can translate a large number of client addresses into a pool of external ones. However, the performance of this solution leaves much to be desired. In my first attempts at using GNU/Linux for NAT (circa 2013) I was able to get around 7Gbit/s at 0.8Mpps per server (Xeon E5-1650v2). Since that time, many different optimizations have been made in the GNU / Linux kernel network stack, the performance of a single server on the same hardware has grown to almost 18-19 Gbit / s at 1.8-1.9 Mpps (these were the limit values), but the need for traffic volume, processed by a single server grew much faster. As a result, load balancing schemes for different servers were developed, but all this increased the complexity of setting up, maintaining and maintaining the quality of the services provided.

NTFables

Now the fashionable direction in software "shifting bags" is the use of DPDK and XDP. A lot of articles have been written on this topic, many different presentations have been made, commercial products are appearing (for example, SCAT from VasExperts). But in the conditions of limited resources of programmers from telecom operators, it is quite problematic to cut some kind of "part" on the basis of these frameworks on your own. It will be much more difficult to operate such a solution in the future, in particular, it will be necessary to develop diagnostic tools. For example, a regular tcpdump with DPDK just won’t work, and it won’t β€œsee” packets sent back to the wires using XDP. Against the background of all the talk about new technologies for outputting packet forwarding to user-space, the reports ΠΈ Articles Pablo Neira Ayuso, iptables maintainer, on developing flow offloading in nftables. Let's look at this mechanism in more detail.

The main idea is that if the router missed the packets of one session in both directions of the flow (the TCP session went into the ESTABLISHED state), then there is no need to pass subsequent packets of this session through all firewall rules, because all these checks will still end with the transfer of the packet further to the routing. Yes, and actually the choice of the route does not need to be performed - we already know to which interface and to which host we need to send packets within this session. It remains only to save this information and use it for routing at an early stage of packet processing. When performing NAT, you must additionally save information about changes to addresses and ports, translated by the nf_conntrack module. Yes, of course, in this case, various policers and other information and statistical rules in iptables stop working, but within the framework of the task of a separate standing NAT or, for example, a border, this is not so important, because services are distributed across devices.

Configuration

To use this feature we need:

  • Use a fresh kernel. Despite the fact that the functionality itself appeared back in the 4.16 kernel, for quite a long time it was very β€œraw” and regularly caused kernel panic. Everything stabilized around December 2019, when LTS kernels 4.19.90 and 5.4.5 were released.
  • Rewrite iptables rules to nftables format using fairly recent version of nftables. Exactly works in version 0.9.0

If everything is clear with the first point, the main thing is not to forget to include the module in the configuration during assembly (CONFIG_NFT_FLOW_OFFLOAD=m), then the second point requires clarification. The nftables rules are described in a completely different way than in iptables. Documentation reveals almost all moments, there are also special converters rules from iptables to nftables. Therefore, I will only give an example of setting up NAT and flow offload. A small legend for an example: , - these are the network interfaces through which traffic passes, in reality there can be more than two of them. , β€” starting and ending address of the range of "white" addresses.

The NAT configuration is very simple:

#! /usr/sbin/nft -f

table nat {
        chain postrouting {
                type nat hook postrouting priority 100;
                oif <o_if> snat to <pool_addr_start>-<pool_addr_end> persistent
        }
}

Flow offload is a little more complicated, but quite understandable:

#! /usr/sbin/nft -f

table inet filter {
        flowtable fastnat {
                hook ingress priority 0
                devices = { <i_if>, <o_if> }
        }

        chain forward {
                type filter hook forward priority 0; policy accept;
                ip protocol { tcp , udp } flow offload @fastnat;
        }
}

Here, in fact, is the whole setting. Now all TCP / UDP traffic will go to the fastnat table and be processed much faster.

The results

To make it clear how β€œmuch faster” this is, I will attach a screenshot of the load on two real servers, with the same stuffing (Xeon E5-1650v2), identically configured, using the same Linux kernel, but performing NAT in iptables (NAT4) and in nftables (NAT5).

Fast routing and NAT in Linux

There is no packets per second graph in the screenshot, but in the load profile of these servers, the average packet size is around 800 bytes, so the values ​​\u1.5b\u30bare up to 3Mpps. As you can see, the performance margin of the server with nftables is huge. Currently, this server handles up to 40Gbit/s at XNUMXMpps and is clearly able to hit the physical network limit of XNUMXGbps while having free CPU resources.

I hope this material will be useful to network engineers trying to improve the performance of their servers.

Source: habr.com

Add a comment