Cloudflare DNS over TLS huge query time with unbound

I'm using Cloudflare DNS over TLS with OpenWrt 19.07.1 and unbound 1.9.6-3 and the query time passed from 10/20 msec IPv4/IPv6 with cloudflare standard DNS to more than 120-200 msec with DoT.
why?

root@router:/etc/config# cat unbound 
config unbound
	option add_extra_dns '0'
	option add_local_fqdn '1'
	option add_wan_fqdn '0'
	option dhcp_link 'none'
	option dhcp4_slaac6 '0'
	option dns64 '0'
	option dns64_prefix '64:ff9b::/96'
	option domain 'lan'
	option domain_type 'static'
	option edns_size '1280'
	option extended_stats '0'
	option hide_binddata '1'
	option listen_port '53'
	option localservice '1'
	option manual_conf '0'
	option num_threads '1'
	option protocol 'default'
	option query_minimize '0'
	option query_min_strict '0'
	option rebind_localhost '0'
	option rebind_protection '1'
	option recursion 'default'
	option resource 'default'
	option root_age '9'
	option ttl_min '120'
	option unbound_control '0'
	option validator '0'
	option validator_ntp '1'
	option verbosity '1'
	list trigger_interface 'lan'
	list trigger_interface 'wan'
	#list domain_insecure 'ntp.example.com'

config zone
	option enabled '1'
	option fallback '0'
	option zone_type 'forward_zone'
	option tls_upstream '1'
	option tls_index 'cloudflare-dns.com'
	list zone_name '.'
	list server '1.1.1.1'
	list server '1.0.0.1'
	list server '2606:4700:4700::1111'
	list server '2606:4700:4700::1001'



openx@pc:~$ kdig openwrt.org @1.1.1.1
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 10891
;; Flags: qr rd ra; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 0
;; QUESTION SECTION:
;; openwrt.org.        		IN	A
;; ANSWER SECTION:
openwrt.org.        	2053	IN	A	139.59.209.225
;; Received 45 B
;; From 1.1.1.1@53(UDP) in 11.3 ms



openx@pc:~$ kdig openwrt.org @2606:4700:4700::1111
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 65179
;; Flags: qr rd ra; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 0
;; QUESTION SECTION:
;; openwrt.org.        		IN	A
;; ANSWER SECTION:
openwrt.org.        	2101	IN	A	139.59.209.225
;; Received 45 B
;; From 2606:4700:4700::1111@53(UDP) in 20.5 ms



openx@pc:~$ kdig openwrt.org @192.168.1.1
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 10454
;; Flags: qr rd ra; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 0
;; QUESTION SECTION:
;; openwrt.org.        		IN	A
;; ANSWER SECTION:
openwrt.org.        	2153	IN	A	139.59.209.225
;; Received 45 B
;; From 192.168.2.1@53(UDP) in 120.6 ms



openx@pc:~$ kdig openwrt.org +tls @1.1.1.1
;; TLS session (TLS1.3)-(ECDHE-X25519)-(ECDSA-SECP256R1-SHA256)-(AES-256-GCM)
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 55827
;; Flags: qr rd ra; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 1
;; EDNS PSEUDOSECTION:
;; Version: 0; flags: ; UDP size: 1452 B; ext-rcode: NOERROR
;; QUESTION SECTION:
;; openwrt.org.        		IN	A
;; ANSWER SECTION:
openwrt.org.        	1187	IN	A	139.59.209.225
;; Received 56 B
;; From 1.1.1.1@853(TCP) in 18.4 ms



openx@pc:~$ kdig openwrt.org +tls @2606:4700:4700::1111
;; TLS session (TLS1.3)-(ECDHE-X25519)-(ECDSA-SECP256R1-SHA256)-(AES-256-GCM)
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 24209
;; Flags: qr rd ra; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 1
;; EDNS PSEUDOSECTION:
;; Version: 0; flags: ; UDP size: 1452 B; ext-rcode: NOERROR
;; QUESTION SECTION:
;; openwrt.org.        		IN	A
;; ANSWER SECTION:
openwrt.org.        	1233	IN	A	139.59.209.225
;; Received 56 B
;; From 2606:4700:4700::1111@853(TCP) in 39.0 ms

It might well have to do with the overhead of establishing a TLS connection and doing encryption. Increase your cache size on dnsmasq and this is mostly taken care of after a while by caching.

2 Likes

Why I should increase cache size on dnsmasq if I'm using unbound as DNS?

# uci show dhcp
dhcp.@dnsmasq[0]=dnsmasq
dhcp.@dnsmasq[0].domainneeded='1'
dhcp.@dnsmasq[0].localise_queries='1'
dhcp.@dnsmasq[0].rebind_protection='1'
dhcp.@dnsmasq[0].rebind_localhost='1'
dhcp.@dnsmasq[0].local='/lan/'
dhcp.@dnsmasq[0].expandhosts='1'
dhcp.@dnsmasq[0].authoritative='1'
dhcp.@dnsmasq[0].readethers='1'
dhcp.@dnsmasq[0].leasefile='/tmp/dhcp.leases'
dhcp.@dnsmasq[0].localservice='1'
dhcp.@dnsmasq[0].interface='lan'
dhcp.@dnsmasq[0].port='0'
dhcp.lan=dhcp
dhcp.lan.interface='lan'
dhcp.lan.start='100'
dhcp.lan.limit='150'
dhcp.lan.leasetime='12h'
dhcp.lan.dhcpv6='server'
dhcp.lan.ra='server'
dhcp.lan.ra_management='1'
dhcp.wan=dhcp
dhcp.wan.interface='wan'
dhcp.wan.ignore='1'
dhcp.odhcpd=odhcpd
dhcp.odhcpd.maindhcp='0'
dhcp.odhcpd.leasefile='/tmp/hosts/odhcpd'
dhcp.odhcpd.leasetrigger='/usr/sbin/odhcpd-update'
dhcp.odhcpd.loglevel='4'

Ah, I use unbound as the upstream resolver for dnsmasq.... never mind. unbound must do some caching though right? I didn't look into it.

A quick update based on my investigations to who may be interested.

There are at least those two problems that have a bad impact on DoT with Unbound and Cloudflare.

1)IPv4 DoT is managed by a closer to me data center than IPv6 DoT, I see this in the colo data from
https://1.1.1.1/cdn-cgi/trace
https://[2606:4700:4700::1111]/cdn-cgi/trace

2)actually unbound create a new connection for each new lookup, there is in plan to include connection reuse in future
https://github.com/NLnetLabs/unbound/issues/47

To mitigate the big query time is possible to increase cache seconds in option ttl_min in /etc/config/unbound

By now I have this ugly performances, average recursion with DoT 0.454024 sec, with DNS over 53 was 0.15 sec

daemon.info unbound: [2122:0] info: server stats for thread 0: 10271 queries, 4891 answers from cache, 5380 recursions, 0 prefetch, 0 rejected by ip ratelimiting
daemon.info unbound: [2122:0] info: server stats for thread 0: requestlist max 15 avg 2.79517 exceeded 0 jostled 0
daemon.info unbound: [2122:0] info: average recursion processing time 0.454024 sec
daemon.info unbound: [2122:0] info: histogram of recursion processing times
daemon.info unbound: [2122:0] info: [25%]=0.219661 median[50%]=0.365625 [75%]=0.588183
daemon.info unbound: [2122:0] info: lower(secs) upper(secs) recursions
daemon.info unbound: [2122:0] info:    0.000000    0.000001 102
daemon.info unbound: [2122:0] info:    0.008192    0.016384 1
daemon.info unbound: [2122:0] info:    0.016384    0.032768 3
daemon.info unbound: [2122:0] info:    0.032768    0.065536 3
daemon.info unbound: [2122:0] info:    0.065536    0.131072 62
daemon.info unbound: [2122:0] info:    0.131072    0.262144 1737
daemon.info unbound: [2122:0] info:    0.262144    0.524288 1981
daemon.info unbound: [2122:0] info:    0.524288    1.000000 1087
daemon.info unbound: [2122:0] info:    1.000000    2.000000 344
daemon.info unbound: [2122:0] info:    2.000000    4.000000 59
daemon.info unbound: [2122:0] info:    4.000000    8.000000 1

PR11416 change anything

1.10.0 is not jet available for the upgrade on mt7621, I'll try when available

Why? What specific change? The change seems to be for unbound-deamon-heavy / libunbound-heavy

The default is 120. Is it really time in seconds? If so, really only 2 minutes?

In unbound.conf are a lot ttls ;- )

It seems the config generated from uci is /overlay/upper/etc/unbound/unbound.conf

cache-min-ttl: 120
cache-max-ttl: 72000

PS: unbound-daemon-heavy

I did some kdig test and if I do again kdig under 120 sec the quick answer come from cache and after 120 sec the long time answer come from recursion, so yes, 120 are seconds and the max is 1800.
https://github.com/openwrt/packages/blob/openwrt-19.07/net/unbound/files/README.md

Are u using unbound-daemon or unbound-daemon-heavy?

unbound-daemon

I wonder how is the logic between cache-min-ttl / cache-max-ttl

Interestingly: I checked my syslog and although the cache-min-ttl is only 120s ...server stats 10093 queries, 9202 answers from cache. More than I expected...cache-max-ttl kicks in?

Curiosity on my part due to what I am seeing here in testing that PR, yes I am using heavy. I forgot the OP was on 19.x from my first read a few days back.

When testing new installs with Unbound, and therefore doing lots of resets, you will experience lag from DoT while the cache fills. However the majority of DNS lookups are repeats so in a long term production installation it doesn't matter. If you set the cache strategy to be "aggressive" in UCI, then Unbound will automatically prefetch popular DNS requests before TTL expire.

cache-min-ttl: is intended to repair poor mans load balancing where the DNS records for multiple servers are rotated with insanely short TTL. See Package README. Nothing should generally change in 2-5 minutes or the internet would break. If the minimum TTL is more than half hour (1800), then when outages happen or major cloud services float to a new data center, you may find yourself stranded during the switch over.