Stop Using Ridiculously Small TTL for DNS

Low DNS latency is a key factor for a fast Internet experience. To minimize it, it is important to carefully select DNS servers and anonymous relays. But the first step is to get rid of useless requests.

This is why DNS was originally designed as a heavily cacheable protocol. Zone administrators set the time-to-live (TTL) for individual records, and resolvers use this information when storing records in memory to avoid unnecessary traffic.

Is caching efficient? A couple of years ago, my little research showed that it was not perfect. Let's take a look at the current state of affairs.

To collect information, I patched Encrypted DNS Server to store the TTL value for the response. It is defined as the minimum TTL of its entries, for each incoming request. This gives a good overview of the TTL distribution of real traffic, and also takes into account the popularity of individual requests. The patched version of the server worked for several hours.

The resulting dataset consists of 1 records (name, qtype, TTL, timestamp). Here is the overall TTL distribution (X-axis is TTL in seconds):

Stop Using Ridiculously Small TTL for DNS

Aside from a minor bump at 86 (mostly for SOA records), it's fairly obvious that TTLs are in the low range. Let's take a closer look:

Stop Using Ridiculously Small TTL for DNS

Okay, TTL over 1 hour is not statistically significant. Then let's focus on the range 0βˆ’3600:

Stop Using Ridiculously Small TTL for DNS

Most TTLs from 0 to 15 minutes:

Stop Using Ridiculously Small TTL for DNS

The vast majority of 0 to 5 minutes:

Stop Using Ridiculously Small TTL for DNS

It's not very good.

The cumulative distribution makes the problem even more obvious:

Stop Using Ridiculously Small TTL for DNS

Half of the DNS responses have a TTL of 1 minute or less, and three-quarters have a TTL of 5 minutes or less.

But wait, it's actually worse. After all, this is TTL from authoritative servers. However, client resolvers (eg routers, local caches) receive TTL from upstream resolvers and it decreases every second.

Thus, the client can actually use each entry, on average, for half the original TTL, after which it will send a new request.

Maybe these very low TTLs are only for unusual requests and not for popular websites and APIs? Let's get a look:

Stop Using Ridiculously Small TTL for DNS

The x-axis is TTL, the y-axis is query popularity.

Unfortunately, the most popular queries are also the worst cached.

Approximate:

Stop Using Ridiculously Small TTL for DNS

Verdict: Really bad. It was bad before, but it got worse. DNS caching has become almost useless. As fewer people use their ISP's DNS resolver (for good reason), the increase in latency becomes more noticeable.

DNS caching has become useful only for content that no one visits.

Also note that the software may differently interpret low TTL.

Why is that?

Why are DNS records set to such a low TTL?

  • Legacy load balancers are left with default settings.
  • There are myths that DNS load balancing depends on TTL (this is not true - since Netscape Navigator, clients choose a random IP address from the RR set and transparently try another if they cannot connect)
  • Administrators want to apply changes immediately because it's easier to plan.
  • The administrator of the DNS server or load balancer sees it as his task to efficiently deploy the configuration that users request, and not to speed up sites and services.
  • Low TTLs give peace of mind.
  • People initially set low TTL for testing and then forget to change them.

I didn't include "failover" in the list as it's less and less relevant. If you need to redirect users to another network just to display an error page when absolutely everything else is broken, a delay of more than 1 minute is probably acceptable.

In addition, the minute TTL means that if authoritative DNS servers are blocked for more than 1 minute, no one else can access dependent services. And redundancy won't help if the cause is a configuration error or a hack. On the other hand, with reasonable TTLs, many clients will continue to use the previous configuration and never notice anything.

Low TTLs are largely to blame for CDN services and load balancers, especially when they combine CNAMEs with small TTLs and records with similarly small (but independent) TTLs:

$ drill raw.githubusercontent.com
raw.githubusercontent.com.	9	IN	CNAME	github.map.fastly.net.
github.map.fastly.net.	20	IN	A	151.101.128.133
github.map.fastly.net.	20	IN	A	151.101.192.133
github.map.fastly.net.	20	IN	A	151.101.0.133
github.map.fastly.net.	20	IN	A	151.101.64.133

Whenever the CNAME or any of the A records expires, a new request has to be sent. Both have a 30 second TTL, but it doesn't match. The actual average TTL will be 15 seconds.

But wait! Still worse. Some resolvers behave very badly in this situation with two associated low TTLs:

$ drill raw.githubusercontent.com @4.2.2.2 raw.githubusercontent.com. 1 IN CNAME github.map.fastly.net. github.map.fastly.net. 1 IN A 151.101.16.133

Level3 resolver probably works on BIND. If you keep sending this request, it will always return a TTL of 1. Essentially, raw.githubusercontent.com never cached.

Here is another example of such a situation with a very popular domain:

$ drill detectportal.firefox.com @1.1.1.1
detectportal.firefox.com.	25	IN	CNAME	detectportal.prod.mozaws.net.
detectportal.prod.mozaws.net.	26	IN	CNAME	detectportal.firefox.com-v2.edgesuite.net.
detectportal.firefox.com-v2.edgesuite.net.	10668	IN	CNAME	a1089.dscd.akamai.net.
a1089.dscd.akamai.net.	10	IN	A	104.123.50.106
a1089.dscd.akamai.net.	10	IN	A	104.123.50.88

At least three CNAME records. Ai. One has a decent TTL, but it's completely useless. In other CNAMEs, the initial TTL is 60 seconds, but for domains akamai.net the maximum TTL is 20 seconds and none of them are in phase.

What about domains that constantly poll Apple devices?

$ drill 1-courier.push.apple.com @4.2.2.2
1-courier.push.apple.com.	1253	IN	CNAME	1.courier-push-apple.com.akadns.net.
1.courier-push-apple.com.akadns.net.	1	IN	CNAME	gb-courier-4.push-apple.com.akadns.net.
gb-courier-4.push-apple.com.akadns.net.	1	IN	A	17.57.146.84
gb-courier-4.push-apple.com.akadns.net.	1	IN	A	17.57.146.85

Same problem as Firefox and TTL will get stuck at 1 second most of the time when using the Level3 resolver.

dropbox?

$ drill client.dropbox.com @8.8.8.8 client.dropbox.com. 7 IN CNAME client.dropbox-dns.com. client.dropbox-dns.com. 59 IN A 162.125.67.3 $ drill client.dropbox.com @4.2.2.2 client.dropbox.com. 1 IN CNAME client.dropbox-dns.com. client.dropbox-dns.com. 1 IN A 162.125.64.3

At the record safebrowsing.googleapis.com the TTL value is 60 seconds, like the Facebook domains. And, again, from the point of view of the client, these values ​​are halved.

How about setting the minimum TTL?

Using the name, request type, TTL, and initially stored timestamp, I wrote a script to simulate 1,5 million requests going through a caching resolver to estimate the volume of extra requests sent due to an expired cache entry.

47,4% of requests were made after an existing record expired. This is unreasonably high.

What will be the impact on caching if the minimum TTL is set?

Stop Using Ridiculously Small TTL for DNS

The x-axis is the minimum TTL values. Records with original TTLs above this value are not affected.

The y-axis is the percentage of requests from a client that already has a cached entry, but it has expired and is making a new request.

The share of "extra" requests is reduced from 47% to 36% by simply setting the minimum TTL to 5 minutes. Setting the minimum TTL to 15 minutes reduces these requests to 29%. The minimum TTL of 1 hour reduces them to 17%. Significant difference!

How about not changing anything on the server side, but instead setting minimum TTLs in client DNS caches (routers, local resolvers)?

Stop Using Ridiculously Small TTL for DNS

Requests required are reduced from 47% to 34% with a minimum TTL of 5 minutes, to 25% with a minimum of 15 minutes, and to 13% with a minimum of 1 hour. Perhaps the optimal value is 40 minutes.

The impact of this minimal change is enormous.

What are the consequences?

Of course, the service can be moved to a new cloud provider, a new server, a new network, requiring clients to use the latest DNS records. And a sufficiently small TTL helps to smoothly and imperceptibly make such a transition. But with the transition to a new infrastructure, no one expects clients to migrate to new DNS records within 1 minute, 5 minutes, or 15 minutes. Setting the minimum lifetime to 40 minutes instead of 5 minutes will not prevent users from accessing the service.

However, this will significantly reduce latency and increase privacy and reliability by avoiding unnecessary requests.

Of course, the RFCs say that TTL must be strictly enforced. But the reality is that the DNS system has become too inefficient.

If you are running authoritative DNS servers, please check your TTLs. Do you really need such ridiculously low values?

Of course, there are good reasons for setting low TTLs for DNS records. But not for 75% of DNS traffic, which remains virtually unchanged.

And if for some reason you really need to use low TTLs for DNS, make sure your site doesn't have caching enabled at the same time. For the same reasons.

If you have a local DNS cache running, such as dnscrypt-proxy, which allows you to set the minimum TTL, use this function. This is fine. Nothing bad will happen. Set the minimum TTL between approximately 40 minutes (2400 seconds) and 1 hour. Quite a reasonable range.

Source: habr.com