Bug 1565332 - Painfully slow DNS name resolution
Summary: Painfully slow DNS name resolution
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 27
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Carlos O'Donell
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-09 20:41 UTC by Arnaud Lacombe
Modified: 2018-04-10 15:14 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-10 15:14:01 UTC
Type: Bug


Attachments (Terms of Use)

Description Arnaud Lacombe 2018-04-09 20:41:26 UTC
Description of problem:
Any program relying on the system name resolution library is slow.

Version-Release number of selected component (if applicable):

How reproducible: always

Steps to Reproduce:
1. cat /etc/resolv.conf:
nameserver 8.8.8.8
nameserver 4.2.2.2
2. % time ping -c 1 -n google.com
PING google.com (172.217.3.206) 56(84) bytes of data.
64 bytes from 172.217.3.206: icmp_seq=1 ttl=54 time=35.3 ms

--- google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 35.318/35.318/35.318/0.000 ms
ping -c 1 -n google.com  0.00s user 0.00s system 0% cpu 5.116 total

Actual results: SLOW


Expected results: !SLOW


Additional info:
% sudo tcpdump -ni eth0 port 53
[sudo] password for al:
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp2s0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:37:09.093095 IP 192.168.0.16.47046 > 8.8.8.8.domain: 5425+ A? google.com. (28)
13:37:09.093120 IP 192.168.0.16.47046 > 8.8.8.8.domain: 25235+ AAAA? google.com. (28)
13:37:09.129277 IP 8.8.8.8.domain > 192.168.0.16.47046: 5425 1/0/0 A 216.58.217.46 (44)
13:37:14.097374 IP 192.168.0.16.47046 > 8.8.8.8.domain: 5425+ A? google.com. (28)
13:37:14.133553 IP 8.8.8.8.domain > 192.168.0.16.47046: 5425 1/0/0 A 216.58.217.46 (44)
13:37:14.133623 IP 192.168.0.166.47046 > 8.8.8.8.domain: 25235+ AAAA? google.com. (28)
13:37:14.176215 IP 8.8.8.8.domain > 192.168.0.16.47046: 25235 1/0/0 AAAA 2607:f8b0:400a:809::200e (56)

Comment 1 Arnaud Lacombe 2018-04-09 20:52:35 UTC
Additional info:

Adding `options single-request' to /etc/resolv.conf fixes the issue, however, this is far from ideal in a DHCP setup (ie. with /etc/resolv.conf automatically generated). Also, no, I do *not* have any form of IPv6 connectivity.

Comment 2 Carlos O'Donell 2018-04-10 02:22:39 UTC
(In reply to Arnaud Lacombe from comment #1)
> Additional info:
> 
> Adding `options single-request' to /etc/resolv.conf fixes the issue,
> however, this is far from ideal in a DHCP setup (ie. with /etc/resolv.conf
> automatically generated). Also, no, I do *not* have any form of IPv6
> connectivity.

Could you please explain what you consider slow, with timings as examples, and what you consider to be fast or the fastest possible timing?

From your posted results the 'time' information looks interleaved with the output of ping. Just to give you an example of a clean result:

$ time ping -c 1 -n google.com
PING google.com (172.217.7.142) 56(84) bytes of data.
64 bytes from 172.217.7.142: icmp_seq=1 ttl=48 time=68.3 ms

--- google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 68.342/68.342/68.342/0.000 ms

real	0m0.153s
user	0m0.001s
sys	0m0.008s

Are you reporting that there is a difference between the result of the ping latency of 63.3ms and the program execution time of 153ms?

Or are you reporting that in your tcpdump that the IPv6 AAAA query is returning only very slowly and after having at least one additional retry resulting in a 5 second delay?

Are you running a local caching daemon? nscd? sssd? dnsmasq? bind?

In general I can't reproduce your results. I'm in North America and the Google servers respond to the A and AAAA query within 3ms of eachother.

I think there is something on your network which is causing the delays with the IPv6 result. For such networks the option 'single-request' is the correct workaround if your ISP has a transparent DNS proxy (mine does not) in place which has a defect in handling both queries in parallel.

In summary:

* Something is odd about your network that causes IPv6 problems. Even if you don't have IPv6 connectivity, the IPv6 answer may still be useful to an application and so it is queried as part of the lookup. You should confirm that you don't have a transparent DNS proxy in the way. Working around that with 'options single-request' should help.

* Setup a local DNS caching daemon and see if it fixes the issue, if it does then there is some problem with DNS on your network. Again you will have to reach out to your ISP.

Setting 'needinfo' since there are a lot of open questions here.

Comment 3 Florian Weimer 2018-04-10 02:47:55 UTC
(In reply to Arnaud Lacombe from comment #0)
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on enp2s0, link-type EN10MB (Ethernet), capture size 262144 bytes
> 13:37:09.093095 IP 192.168.0.16.47046 > 8.8.8.8.domain: 5425+ A? google.com.
> (28)
> 13:37:09.093120 IP 192.168.0.16.47046 > 8.8.8.8.domain: 25235+ AAAA?
> google.com. (28)
> 13:37:09.129277 IP 8.8.8.8.domain > 192.168.0.16.47046: 5425 1/0/0 A
> 216.58.217.46 (44)
> 13:37:14.097374 IP 192.168.0.16.47046 > 8.8.8.8.domain: 5425+ A? google.com.
> (28)

You need to figure out why there isn't a response to the AAAA query from 8.8.8.8.  If “options single-request” fixes this, it's likely a broken middlebox on your end.

Comment 4 Arnaud Lacombe 2018-04-10 04:15:26 UTC
The DNS lookup takes 5s, cf. the tcpdump capture. I'm not running any kind of caching daemon, and did not mess voluntarily with the name resolution. This is "painfully" slow, because any program (curl, wget, ssh) using the system's resolver ends up in this 5s timeout.

As far as I can tell, this issue appeared earlier this year, potentially after upgrading from Fedora 26 to Fedora 27.

What is unclear is *why* is there a 5s delay between the two request, and to some extend, why is there two request to begin with. The first request is clearly being answered and carries a valid response. The only difference I can see is the AAAA request at 13:37:09.093120 is not answered.

Comment 5 Carlos O'Donell 2018-04-10 04:23:05 UTC
(In reply to Arnaud Lacombe from comment #4)
> The DNS lookup takes 5s, cf. the tcpdump capture. I'm not running any kind
> of caching daemon, and did not mess voluntarily with the name resolution.
> This is "painfully" slow, because any program (curl, wget, ssh) using the
> system's resolver ends up in this 5s timeout.
> 
> As far as I can tell, this issue appeared earlier this year, potentially
> after upgrading from Fedora 26 to Fedora 27.
> 
> What is unclear is *why* is there a 5s delay between the two request, and to
> some extend, why is there two request to begin with. The first request is
> clearly being answered and carries a valid response. The only difference I
> can see is the AAAA request at 13:37:09.093120 is not answered.

There is a 5s delay because RES_TIMEOUT is 5s (see timeout:n) by default.

This problem is unlikely to be tied to an F26 to F27 upgrade since none of this particular code has changed in any significant way. If the DNS server does not respond the stub resolver waits, and then retries.

The Google DNS servers are perfectly capable of answering parallel IPv4/IPv6 queries, something on your network (or upstream network) appears to be disrupting the delivery of the IPv6 answer.

Again, there are two requests because it's easy to send both at the same time in parallel because they may be needed, and it ensures IPv6 answers are available if needed. Normally this is not a problem.

Given the current analysis this does not look like a glibc issue.

Comment 6 Arnaud Lacombe 2018-04-10 04:28:17 UTC
and it is not a packet filter issue either, as this issue still happen after running:

iptables -F
iptables -t nat -F
iptables -t mangle -F

Comment 7 Arnaud Lacombe 2018-04-10 05:05:47 UTC
From the sound of it, it would seem I'm hitting this:

resolv/res_send.c:
          /* There are quite a few broken name servers out
             there which don't handle two outstanding
             requests from the same source.  There are also
             broken firewall settings.  If we time out after
             having received one answer switch to the mode
             where we send the second request only once we
             have received the first answer.  */

I guess I'll have to live with `options single-request' :-(

Comment 8 Carlos O'Donell 2018-04-10 15:14:01 UTC
(In reply to Arnaud Lacombe from comment #7)
> From the sound of it, it would seem I'm hitting this:
> 
> resolv/res_send.c:
>           /* There are quite a few broken name servers out
>              there which don't handle two outstanding
>              requests from the same source.  There are also
>              broken firewall settings.  If we time out after
>              having received one answer switch to the mode
>              where we send the second request only once we
>              have received the first answer.  */
> 
> I guess I'll have to live with `options single-request' :-(

Thanks for all of your help looking into this issue. I'm sorry you have to configure with 'options single-request'. There are really very few broken DNS servers which I've ever seen, and so upstream doesn't have any automatic detection, just the switch for a sysadmin to use. I admit it would be interesting to try detect this scenario statistically in realtime, but that's a lot of resources for the non-common case.

I'm marking the issue CLOSED/NOTABUG, but if you have any more feedback or find the issue is not with your network then please feel free to reopen the issue.


Note You need to log in before you can comment on or make changes to this bug.