Bug 471450

Summary: Occasional failure on lookup when empty AAAA response comes before good A response
Product: [Fedora] Fedora Reporter: Mads Kiilerich <mads>
Component: glibcAssignee: Jakub Jelinek <jakub>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 10CC: drepper, jakub, k.georgiou, sacredfox, tim, vanmeeuwen+fedora
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-12-08 22:23:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dns lookup test program none

Description Mads Kiilerich 2008-11-13 19:07:49 UTC
Description of problem:

Initial dns lookups fail. I most often see it with ssh (run from mercurial on an ssh:// address).

I caught it on a wireshark trace.

ssh sends an A and an AAAA query, and it (only) gets a response for AAAA containing only a SOA, and that causes "ssh: Could not resolve hostname x: Name or service not known".

Usually, when the lookup succeeds, I can see that the A response with the right address arrives before the (same) AAAA response. So the error comes when the A response is delayed or lost.

I would expect glibc to somehow resend the A query when it don't get a response to it.

No specific ipv6 configuration has been made. The setup might be broken, just like all other places where there is partial but unused ipv6 support. I would expect it to work with ipv4 anyway.

The nameserver configured in /etc/resolv.conf (and responding) is on LAN and running windows and probably buggy, but it worked with f9, so this looks like a regression.


Version-Release number of selected component (if applicable):

glibc-2.8.90-16.i686


How reproducible:

A couple of times a day. Often enough to be very annoying, but seldom enough to be hard to reproduce ...


Additional info:

It could perhaps be a variant of Bug 460561 or Bug 469299?

Comment 1 Mads Kiilerich 2008-11-13 20:28:52 UTC
Openssh do

        memset(&hints, 0, sizeof(hints));
        hints.ai_family = family;
        hints.ai_socktype = SOCK_STREAM;
        snprintf(strport, sizeof strport, "%u", port);
        if ((gaierr = getaddrinfo(host, strport, &hints, &aitop)) != 0)
                fatal("%s: Could not resolve hostname %.100s: %s", __progname,
                    host, ssh_gai_strerror(gaierr));

and are thus not following Ullrichs advice on Bug 459756#24 - can that explain this?

Perhaps openssh should be bugged? But for now, for f10: The openssh apparently worked before, and it would probably be easier to put a workaround in glibc than fixing all applications which have another (possibly wrong) opinion on how to do name resolution...

Comment 2 Ulrich Drepper 2008-11-14 17:24:07 UTC
ssh certainly should use AI_ADDRCONFIG.  It should still work.

You said you captured the traffic.  Try to compile a little program with essentially the code in comment #1, run it under strace, capture DNS traffic using wireshark.

Comment 3 Mads Kiilerich 2008-11-21 20:53:19 UTC
Created attachment 324345 [details]
dns lookup test program

That was hard to reproduce. "My" DNS sometimes AAAA before A, but I can't reproduce it on command.

As workaround I use an ugly hack: I use iptables to drop some of the udp responses, tuning on match on package length, so that it completely drops the A answer but AAAA responses come through:
-A INPUT -p udp -m length --length 77 -j LOG --log-prefix "dropping "
-A INPUT -p udp -m length --length 77 -j DROP
-A INPUT -p udp -j LOG --log-prefix "not dropping "

With that I can reproduce the ssh behaviour I mentioned - and do the following.


I run the attached test program with hg as parameter. resolv.conf has "domain dadomain.com" and "search dadomain.com" and "nameserver 192.168.45.13".

When using ai_family = AF_INET then it twice sends an request for A and waits in a poll for 5 s, and then it fails with -2 (EAI_NONAME?).

With ai_family = AF_UNSPEC it sends an request for A and an request for AAAA, gets the AAAA response (with soa of the nameserver as authoritative ns for the domain), waits in poll for 5 s, and then it fails the same way:

getaddrinfo("hg", "(null)", {ai_family=0, ai_socktype=1}, {}) = -2


(I am not familiar with NS api, but in either case I would expect getaddrinfo to return something more like EAI_AGAIN (-3) instead of being conclusive after just 1-2 lossy udp attempts.)


The relevant(?) output from strace -v -s200:

socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.45.13")}, 28) = 0
fcntl64(3, F_GETFL)       = 0x2 (flags O_RDWR)
fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
gettimeofday({1227298167, 857199}, NULL) = 0
poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
send(3, "\232\314\1\0\0\1\0\0\0\0\0\0\2hg\10dadomain\3com\0\0\1\0\1"..., 33, MSG_NOSIGNAL) = 33
poll([{fd=3, events=POLLIN|POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
send(3, "\205\302\1\0\0\1\0\0\0\0\0\0\2hg\10dadomain\3com\0\0\34\0\1"..., 33, MSG_NOSIGNAL) = 33
gettimeofday({1227298167, 858053}, NULL) = 0
poll([{fd=3, events=POLLIN}], 1, 4999) = 1 ([{fd=3, revents=POLLIN}])
ioctl(3, FIONREAD, [84])  = 0
recvfrom(3, "\205\302\205\200\0\1\0\0\0\1\0\0\2hg\10dadomain\3com\0\0\34\0\1\300\17\0\6\0\1\0\0\16\20\0'\4srvx\300\17\nhostmaster\0\0\0\27E\0\0\3\204\0\0\2X\0\1Q\200\0\0\16\20"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.45.13")}, [16]) = 84
gettimeofday({1227298167, 858786}, NULL) = 0
poll([{fd=3, events=POLLIN}], 1, 4998) = 0 (Timeout)
close(3)                  = 0


Using glibc-2.9-2.i686.

Domain name carefully modified to protect the innocent.

Comment 4 Bug Zapper 2008-11-26 05:20:11 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 5 Akihiro Nomura 2008-11-30 10:05:27 UTC
I faced the same problem with glibc/kernel in F-10 release.
$ rpm -q glibc kernel
glibc-2.9-2.i686
kernel-2.6.27.5-117.fc10.i686
To avoid this, I disabled ipv6 using modprobe.conf.
$ cat /etc/modprobe.conf 
install ipv6 :
This workaround would not solve the problem at all,
but it would be helpful to users who does not use IPv6.

Comment 6 Mads Kiilerich 2008-12-08 15:31:48 UTC
glibc-2.9-3 mentioned on bug 459756 seems to work around this problem too. 

BTW: I notice that now requests are sent with 5 s intervals, but while A requests are retried 4 times AAAA requests are only retried 2 times. I don't know if that is intended, but it makes it more OK that it fails with EAI_NONAME in case of failure. At least as long as I don't use IPv6 ...

Thanks, Jakub! I will keep testing it to make sure it really works.

Comment 7 Ulrich Drepper 2008-12-08 22:23:28 UTC
Let's dupe this bug.  No reason to keep it open as well, it's the same issue.

*** This bug has been marked as a duplicate of bug 459756 ***