Beware of vulnerabilities that bring workrounds. Part 1: FragmentSmack/SegmentSmack

Beware of vulnerabilities that bring workrounds. Part 1: FragmentSmack/SegmentSmack

Hi all! My name is Dmitry Samsonov, I work as a leading system administrator at Odnoklassniki. We have over 7 physical servers, 11 containers in our cloud, and 200 applications that form 700 different clusters in various configurations. The vast majority of servers are running CentOS 7.
On August 14, 2018, information about the FragmentSmack vulnerability was published
(CVE-2018-5391) and SegmentSmack (CVE-2018-5390). These are vulnerabilities with a network attack vector and a sufficiently high score (7.5) that threatens with a denial of service (DoS) due to resource exhaustion (CPU). A fix in the core for FragmentSmack was not proposed at that time, moreover, it came out much later than the publication of information about the vulnerability. To eliminate SegmentSmack, it was proposed to update the kernel. The update package itself was released on the same day, all that remained was to install it.
No, we are not against the core update at all! However, there are nuances…

How do we update the core on the prod

In general, nothing complicated:

  1. Download packages;
  2. Install them on a number of servers (including servers hosting our cloud);
  3. Make sure nothing is broken;
  4. Make sure that all standard kernel settings are applied without errors;
  5. Wait a few days;
  6. Check server performance;
  7. Switch the deployment of new servers to the new core;
  8. Update all servers by data center (one data center at a time to minimize the impact on users in case of problems);
  9. Restart all servers.

Repeat for all the branches of the cores we have. At the moment it is:

  • Stock CentOS 7 3.10 - for most regular servers;
  • Vanilla 4.19 - for our one-cloud clouds, because we need BFQ, BBR, etc.;
  • Elrepo kernel-ml 5.2 - for highly loaded distributors, because 4.19 used to be unstable, and the same features are needed.

As you might have guessed, rebooting thousands of servers takes the longest time. Since not all vulnerabilities are critical for all servers, we only reboot those that are directly accessible from the Internet. In the cloud, in order not to limit flexibility, we do not bind externally accessible containers to individual servers with a new core, but reboot all hosts without exception. Fortunately, the procedure is simpler there than with conventional servers. For example, stateless containers can simply move to another server during a reboot.

Nevertheless, there is still a lot of work, and it can take several weeks, and if there are any problems with the new version, up to several months. Attackers are well aware of this, so a plan "B" is needed.

FragmentSmack/SegmentSmack. workaround

Fortunately, for some vulnerabilities such a plan "B" exists, and it is called Workaround. Most often, this is a change in kernel/application settings that allow you to minimize the possible effect or completely eliminate the exploitation of vulnerabilities.

In the case of FragmentSmack/SegmentSmack proposed such Workaround:

«You can change the default values ​​of 4MB and 3MB in net.ipv4.ipfrag_high_thresh and net.ipv4.ipfrag_low_thresh (and their ipv6 counterparts net.ipv6.ipfrag_high_thresh and net.ipv6.ipfrag_low_thresh) to 256 kB and 192 kB respectively or lower. Tests show a slight to significant drop in CPU usage during an attack, depending on hardware, settings, and conditions. However, there may be some performance impact due to ipfrag_high_thresh=262144 bytes, as only two 64K fragments can fit in the rebuild queue at the same time. For example, there is a risk that applications that work with large UDP packets will break».

The parameters themselves in the kernel documentation described like this:

ipfrag_high_thresh - LONG INTEGER
    Maximum memory used to reassemble IP fragments.

ipfrag_low_thresh - LONG INTEGER
    Maximum memory used to reassemble IP fragments before the kernel
    begins to remove incomplete fragment queues to free up resources.
    The kernel still accepts new fragments for defragmentation.

We do not have large UDPs on production services. There is no fragmented traffic on the LAN, on the WAN there is, but not significant. Nothing foreshadows - you can roll Workaround!

FragmentSmack/SegmentSmack. First blood

The first problem we encountered was that cloud containers sometimes applied the new settings only partially (only ipfrag_low_thresh), and sometimes did not apply at all - they just crashed at the start. It was not possible to reproduce the problem stably (manually all settings were applied without any difficulties). Understanding why the container crashes at the start is also not so easy: no errors were found. One thing was known for sure: rolling back the settings solves the problem with container crashes.

Why is it not enough to apply Sysctl on the host? The container lives in its dedicated network Namespace, so at least part of network sysctl parameters in the container may be different from the host.

How exactly are Sysctl settings applied in a container? Since our containers are unprivileged, it will not work to change any Sysctl setting by going into the container itself - there simply aren’t enough rights. To run containers, our cloud at that time used Docker (now podman). The parameters of the new container were passed to Docker via the API, including the necessary Sysctl settings.
During the iteration of versions, it turned out that the Docker API did not return all errors (at least in version 1.10). When we tried to run the container via “docker run”, we finally saw at least something:

write /proc/sys/net/ipv4/ipfrag_high_thresh: invalid argument docker: Error response from daemon: Cannot start container <...>: [9] System error: could not synchronise with container process.

The parameter value is not valid. But why? And why is it not valid only sometimes? It turned out that Docker does not guarantee the order in which Sysctl parameters are applied (latest version checked is 1.13.1), so sometimes ipfrag_high_thresh tried to be set to 256K when ipfrag_low_thresh was still 3M, that is, the upper limit was lower than the lower one, which led to an error.

At that time, we already used our own mechanism for reconfiguring the container after the start (freezing the container through group freezer and executing commands in the namespace of the container via ipnetns), and we have also added Sysctl parameters to this part. The problem was solved.

FragmentSmack/SegmentSmack. First Blood 2

Before we had time to deal with the use of Workaround in the cloud, the first rare complaints from users began to arrive. At that time, several weeks had passed since the start of Workaround on the first servers. Initial investigation showed that complaints were received on individual services, and not on all servers of these services. The problem has again become extremely vague.

First of all, we, of course, tried to roll back the Sysctl settings, but this had no effect. Various manipulations with the server and application settings did not help either. Reboot helped. Reboot for Linux is as unnatural as it was the normal condition for working with Windows in the old days. Nevertheless, it helped, and we wrote off everything as a "glitch in the kernel" when applying the new settings in Sysctl. How frivolous was that...

Three weeks later the problem recurred. The configuration of these servers was pretty simple: Nginx in proxy/balancer mode. Little traffic. New introductory: on clients, the number of 504 errors is increasing every day (Gateway Timeout). The graph shows the number of 504 errors per day for this service:

Beware of vulnerabilities that bring workrounds. Part 1: FragmentSmack/SegmentSmack

All errors are about the same backend - about the one that is in the cloud. The graph of memory consumption for package fragments on this backend looked like this:

Beware of vulnerabilities that bring workrounds. Part 1: FragmentSmack/SegmentSmack

This is one of the most striking manifestations of the problem on the charts of the operating system. In the cloud, just at the same time, another network problem with QoS (Traffic Control) settings was fixed. On the graph of memory consumption for package fragments, it looked exactly the same:

Beware of vulnerabilities that bring workrounds. Part 1: FragmentSmack/SegmentSmack

The assumption was simple: if they look the same on the charts, then they have the same reason. Moreover, any problems with this type of memory are extremely rare.

The essence of the fixed problem was that we used the fq packet scheduler with default settings in QoS. By default, for one connection, it allows you to add 100 packets to the queue, and some connections in a situation of lack of a channel began to fill the queue to failure. In this case, the packets are dropped. In the tc statistics (tc -s qdisc) it looks like this:

qdisc fq 2c6c: parent 1:2c6c limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028 initial_quantum 15140 refill_delay 40.0ms
 Sent 454701676345 bytes 491683359 pkt (dropped 464545, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  1024 flows (1021 inactive, 0 throttled)
  0 gc, 0 highprio, 0 throttled, 464545 flows_plimit

“464545 flows_plimit” is the packets dropped due to exceeding the queue limit of one connection, and “dropped 464545” is the sum of all dropped packets of this scheduler. After increasing the queue length to 1 thousand and restarting the containers, the problem ceased to appear. You can lean back in your chair and have a smoothie.

FragmentSmack/SegmentSmack. Last Blood

Firstly, a few months after the announcement of vulnerabilities in the kernel, a fix for FragmentSmack finally appeared (I remind you that a fix only for SegmentSmack was released along with the announcement in August), which gave us a chance to abandon Workaround, which caused us quite a lot of trouble. During this time, we have already managed to transfer some of the servers to the new core, and now we had to start from the beginning. Why did we update the kernel without waiting for the FragmentSmack fix? The fact is that the process of protecting against these vulnerabilities coincided (and merged) with the process of updating CentOS itself (which takes even more time than updating just the kernel). In addition, SegmentSmack is a more dangerous vulnerability, and a fix for it appeared immediately, so it made sense anyway. However, we couldn't just upgrade the kernel on CentOS, because the FragmentSmack vulnerability that appeared during CentOS 7.5 was only fixed in version 7.6, so we had to stop upgrading to 7.5 and start all over again with upgrading to 7.6. And so it happens too.

Secondly, rare user complaints about problems returned to us. Now we already know for sure that all of them are related to the upload of files from clients to some of our servers. Moreover, a very small number of uploads from the total mass went through these servers.

As we remember from the story above, rolling back Sysctl did not help. Reboot helped, but temporarily.
Suspicions with Sysctl were not removed, but this time it was necessary to collect as much information as possible. It also lacked the ability to reproduce the problem with the upload on the client in order to more precisely study what was happening.

Analysis of all available statistics and logs did not bring us closer to understanding what is happening. There was an acute lack of the ability to reproduce the problem in order to “feel” a specific connection. Finally, the developers on the special version of the application managed to achieve stable reproduction of problems on the test device when connected via Wi-Fi. This was a breakthrough in the investigation. The client connected to Nginx, which proxyed to the backend, which was our Java application.

Beware of vulnerabilities that bring workrounds. Part 1: FragmentSmack/SegmentSmack

The dialogue in case of problems was like this (fixed on the side of the Nginx proxy):

  1. Client: request for information about downloading a file.
  2. Java Server: Answer.
  3. Client: POST with file.
  4. Java server: error.

At the same time, the Java server writes to the log that 0 bytes of data were received from the client, and the Nginx proxy writes that the request took more than 30 seconds (30 seconds is the timeout for the client application). Why timeout and why 0 bytes? From an HTTP point of view, everything works as it should, but the POST with the file seems to disappear from the network. And it disappears between the client and Nginx. It's time to arm yourself with Tcpdump! But first you need to understand the network configuration. Nginx proxy behind L3 balancer NFware. Tunneling is used to deliver packets from the L3 balancer to the server, which adds its headers to the packets:

Beware of vulnerabilities that bring workrounds. Part 1: FragmentSmack/SegmentSmack

At the same time, the network arrives at this server in the form of Vlan-tagged traffic, which also adds its fields to the packets:

Beware of vulnerabilities that bring workrounds. Part 1: FragmentSmack/SegmentSmack

And this traffic can also be fragmented (the very small percentage of incoming fragmented traffic that we talked about in the risk assessment from Workaround), which also changes the content of the headers:

Beware of vulnerabilities that bring workrounds. Part 1: FragmentSmack/SegmentSmack

Once again: packets are encapsulated by a Vlan tag, encapsulated by a tunnel, fragmented. To better understand how this happens, let's trace the packet's route from the client to the Nginx proxy.

  1. The packet gets to the L3 balancer. For correct routing inside the data center, the packet is encapsulated in a tunnel and sent to the network card.
  2. Since the packet + tunnel headers do not fit into the MTU, the packet is cut into fragments and sent to the network.
  3. The switch after the L3 balancer, when receiving a packet, adds a Vlan tag to it and sends it further.
  4. The switch in front of the Nginx proxy sees (by port settings) that the server is expecting a Vlan-encapsulated packet, so it sends it as is, without removing the Vlan tag.
  5. Linux takes fragments of individual packages and glues them together into one big package.
  6. Then the packet gets to the Vlan interface, where the first layer is removed from it - Vlan encapsulation.
  7. Then Linux sends it to the Tunnel interface, where one more layer is removed from it - Tunnel encapsulation.

The difficulty is to pass all this as parameters to tcpdump.
Let's start from the end: are there clean (without extra headers) IP packets from clients with vlan and tunnel encapsulation removed?

tcpdump host <ip клиента>

No, there were no such packages on the server. So the problem must be before. Are there packages with only Vlan encapsulation removed?

tcpdump ip[32:4]=0xx390x2xx

0xx390x2xx is the client's IP address in hex format.
32:4 is the address and length of the field in which the SCR IP is written in the Tunnel packet.

The address of the field had to be selected by brute force, since on the Internet they write about 40, 44, 50, 54, but there was no IP address there. You can also look at one of the packets in hex (parameter -xx or -XX in tcpdump) and calculate at what address the IP you know is at.

Are there packet fragments without Vlan and Tunnel encapsulation removed?

tcpdump ((ip[6:2] > 0) and (not ip[6] = 64))

This magic will show us all the fragments, including the last one. Probably, the same can be filtered by IP, but I did not try, because there are not very many such packets, and the ones I needed were easily found in the general stream. Here they are:

14:02:58.471063 In 00:de:ff:1a:94:11 ethertype IPv4 (0x0800), length 1516: (tos 0x0, ttl 63, id 53652, offset 0, flags [+], proto IPIP (4), length 1500)
    11.11.11.11 > 22.22.22.22: truncated-ip - 20 bytes missing! (tos 0x0, ttl 50, id 57750, offset 0, flags [DF], proto TCP (6), length 1500)
    33.33.33.33.33333 > 44.44.44.44.80: Flags [.], seq 0:1448, ack 1, win 343, options [nop,nop,TS val 11660691 ecr 2998165860], length 1448
        0x0000: 0000 0001 0006 00de fb1a 9441 0000 0800 ...........A....
        0x0010: 4500 05dc d194 2000 3f09 d5fb 0a66 387d E.......?....f8}
        0x0020: 1x67 7899 4500 06xx e198 4000 3206 6xx4 [email protected].
        0x0030: b291 x9xx x345 2541 83b9 0050 9740 0x04 .......A...P.@..
        0x0040: 6444 4939 8010 0257 8c3c 0000 0101 080x dDI9...W.......
        0x0050: 00b1 ed93 b2b4 6964 xxd8 ffe1 006a 4578 ......ad.....jEx
        0x0060: 6966 0000 4x4d 002a 0500 0008 0004 0100 if..MM.*........

14:02:58.471103 In 00:de:ff:1a:94:11 ethertype IPv4 (0x0800), length 62: (tos 0x0, ttl 63, id 53652, offset 1480, flags [none], proto IPIP (4), length 40)
    11.11.11.11 > 22.22.22.22: ip-proto-4
        0x0000: 0000 0001 0006 00de fb1a 9441 0000 0800 ...........A....
        0x0010: 4500 0028 d194 00b9 3f04 faf6 2x76 385x E..(....?....f8}
        0x0020: 1x76 6545 xxxx 1x11 2d2c 0c21 8016 8e43 .faE...D-,.!...C
        0x0030: x978 e91d x9b0 d608 0000 0000 0000 7c31 .x............|Q
        0x0040: 881d c4b6 0000 0000 0000 0000 0000 ..............

These are two fragments of one package (the same ID 53652) with a photograph (you can see the word Exif in the first package). Due to the fact that there are packages at this level, but not in the glued form in the dumps, the problem is clearly with the assembly. Finally, there is documented evidence of this!

The packet decoder did not detect any problems preventing assembly. Tried here: hpd.gasmi.net. At first, when trying to cram something into it, the decoder doesn't like the packet format. It turned out that there were some extra two octets between Srcmac and Ethertype (not related to fragment information). After removing them, the decoder earned. However, he showed no problems.
Whatever one may say, apart from those same Sysctl, nothing else was found. It remained to find a way to identify problematic servers in order to understand the scale and decide on further actions. I quickly found the right counter:

netstat -s | grep "packet reassembles failed”

It is also in snmpd under OID=1.3.6.1.2.1.4.31.1.1.16.1 (ipSystemStatsReasmFails).

"The number of failures detected by the IP re-assembly algorithm (for whatever reason: timed out, errors, etc.)".

Among the group of servers on which the problem was studied, on two this counter increased faster, on two more slowly, and on two more it did not increase at all. Comparing the dynamics of this counter with the dynamics of HTTP errors on the Java server revealed a correlation. That is, the counter could be put on monitoring.

Having a reliable indicator of problems is very important so that you can accurately determine whether a Sysctl rollback helps, as we know from the previous story that this cannot be immediately understood from the application. This indicator would allow you to identify all problem areas in production before users find it.
After rolling back Sysctl, the monitoring errors stopped, so the cause of the problems was proven, as well as the fact that the rollback helps.

We rolled back the fragmentation settings on other servers where the new monitoring caught fire, and somewhere even more memory was allocated for fragments than was the default before (this was udp statistics, the partial loss of which was not noticeable against the general background).

The most important questions

Why are packets fragmented on our L3 balancer? Most of the packets that arrive from users to balancers are SYN and ACK. These packages are small. But since the share of such packages is very large, against their background we did not notice the presence of large packages that began to fragment.

The reason was a broken configuration script advmss on servers with Vlan interfaces (at that time there were very few servers with tagged traffic in production). Advmss allows you to convey to the client information that packets in our direction should be smaller so that after gluing tunnel headers to them, they do not have to be fragmented.

Why didn't Sysctl rollback help, but reboot helped? Sysctl rollback changed the amount of memory available for gluing packets. At the same time, apparently, the very fact of memory overflow for fragments led to the slowdown of connections, which led to the fact that the fragments were delayed in the queue for a long time. That is, the process looped.
Reboot reset the memory and everything came in order.

Was it possible to do without Workaround? Yes, but there is a high risk of leaving users without service in the event of an attack. Of course, the use of Workaround resulted in various problems, including the slowdown of one of the services for users, but nevertheless, we believe that the actions were justified.

Many thanks to Andrey Timofeev (timofeyev) for their assistance in conducting the investigation, as well as to Alexey Krenev (devicex) - for the titanic work on updating Centos and cores on servers. A process that in this case had to be restarted several times, which dragged on for many months.

Source: habr.com

Add a comment