Bug 471450 - Occasional failure on lookup when empty AAAA response comes before good A response
Occasional failure on lookup when empty AAAA response comes before good A res...
Status: CLOSED DUPLICATE of bug 459756
Product: Fedora
Classification: Fedora
Component: glibc (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jakub Jelinek
Fedora Extras Quality Assurance
Depends On:
  Show dependency treegraph
Reported: 2008-11-13 14:07 EST by Mads Kiilerich
Modified: 2008-12-08 17:49 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-12-08 17:23:28 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
dns lookup test program (1.14 KB, text/plain)
2008-11-21 15:53 EST, Mads Kiilerich
no flags Details

  None (edit)
Description Mads Kiilerich 2008-11-13 14:07:49 EST
Description of problem:

Initial dns lookups fail. I most often see it with ssh (run from mercurial on an ssh:// address).

I caught it on a wireshark trace.

ssh sends an A and an AAAA query, and it (only) gets a response for AAAA containing only a SOA, and that causes "ssh: Could not resolve hostname x: Name or service not known".

Usually, when the lookup succeeds, I can see that the A response with the right address arrives before the (same) AAAA response. So the error comes when the A response is delayed or lost.

I would expect glibc to somehow resend the A query when it don't get a response to it.

No specific ipv6 configuration has been made. The setup might be broken, just like all other places where there is partial but unused ipv6 support. I would expect it to work with ipv4 anyway.

The nameserver configured in /etc/resolv.conf (and responding) is on LAN and running windows and probably buggy, but it worked with f9, so this looks like a regression.

Version-Release number of selected component (if applicable):


How reproducible:

A couple of times a day. Often enough to be very annoying, but seldom enough to be hard to reproduce ...

Additional info:

It could perhaps be a variant of Bug 460561 or Bug 469299?
Comment 1 Mads Kiilerich 2008-11-13 15:28:52 EST
Openssh do

        memset(&hints, 0, sizeof(hints));
        hints.ai_family = family;
        hints.ai_socktype = SOCK_STREAM;
        snprintf(strport, sizeof strport, "%u", port);
        if ((gaierr = getaddrinfo(host, strport, &hints, &aitop)) != 0)
                fatal("%s: Could not resolve hostname %.100s: %s", __progname,
                    host, ssh_gai_strerror(gaierr));

and are thus not following Ullrichs advice on Bug 459756#24 - can that explain this?

Perhaps openssh should be bugged? But for now, for f10: The openssh apparently worked before, and it would probably be easier to put a workaround in glibc than fixing all applications which have another (possibly wrong) opinion on how to do name resolution...
Comment 2 Ulrich Drepper 2008-11-14 12:24:07 EST
ssh certainly should use AI_ADDRCONFIG.  It should still work.

You said you captured the traffic.  Try to compile a little program with essentially the code in comment #1, run it under strace, capture DNS traffic using wireshark.
Comment 3 Mads Kiilerich 2008-11-21 15:53:19 EST
Created attachment 324345 [details]
dns lookup test program

That was hard to reproduce. "My" DNS sometimes AAAA before A, but I can't reproduce it on command.

As workaround I use an ugly hack: I use iptables to drop some of the udp responses, tuning on match on package length, so that it completely drops the A answer but AAAA responses come through:
-A INPUT -p udp -m length --length 77 -j LOG --log-prefix "dropping "
-A INPUT -p udp -m length --length 77 -j DROP
-A INPUT -p udp -j LOG --log-prefix "not dropping "

With that I can reproduce the ssh behaviour I mentioned - and do the following.

I run the attached test program with hg as parameter. resolv.conf has "domain dadomain.com" and "search dadomain.com" and "nameserver".

When using ai_family = AF_INET then it twice sends an request for A and waits in a poll for 5 s, and then it fails with -2 (EAI_NONAME?).

With ai_family = AF_UNSPEC it sends an request for A and an request for AAAA, gets the AAAA response (with soa of the nameserver as authoritative ns for the domain), waits in poll for 5 s, and then it fails the same way:

getaddrinfo("hg", "(null)", {ai_family=0, ai_socktype=1}, {}) = -2

(I am not familiar with NS api, but in either case I would expect getaddrinfo to return something more like EAI_AGAIN (-3) instead of being conclusive after just 1-2 lossy udp attempts.)

The relevant(?) output from strace -v -s200:

connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("")}, 28) = 0
fcntl64(3, F_GETFL)       = 0x2 (flags O_RDWR)
fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
gettimeofday({1227298167, 857199}, NULL) = 0
poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
send(3, "\232\314\1\0\0\1\0\0\0\0\0\0\2hg\10dadomain\3com\0\0\1\0\1"..., 33, MSG_NOSIGNAL) = 33
poll([{fd=3, events=POLLIN|POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
send(3, "\205\302\1\0\0\1\0\0\0\0\0\0\2hg\10dadomain\3com\0\0\34\0\1"..., 33, MSG_NOSIGNAL) = 33
gettimeofday({1227298167, 858053}, NULL) = 0
poll([{fd=3, events=POLLIN}], 1, 4999) = 1 ([{fd=3, revents=POLLIN}])
ioctl(3, FIONREAD, [84])  = 0
recvfrom(3, "\205\302\205\200\0\1\0\0\0\1\0\0\2hg\10dadomain\3com\0\0\34\0\1\300\17\0\6\0\1\0\0\16\20\0'\4srvx\300\17\nhostmaster\0\0\0\27E\0\0\3\204\0\0\2X\0\1Q\200\0\0\16\20"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("")}, [16]) = 84
gettimeofday({1227298167, 858786}, NULL) = 0
poll([{fd=3, events=POLLIN}], 1, 4998) = 0 (Timeout)
close(3)                  = 0

Using glibc-2.9-2.i686.

Domain name carefully modified to protect the innocent.
Comment 4 Bug Zapper 2008-11-26 00:20:11 EST
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
Comment 5 Akihiro Nomura 2008-11-30 05:05:27 EST
I faced the same problem with glibc/kernel in F-10 release.
$ rpm -q glibc kernel
To avoid this, I disabled ipv6 using modprobe.conf.
$ cat /etc/modprobe.conf 
install ipv6 :
This workaround would not solve the problem at all,
but it would be helpful to users who does not use IPv6.
Comment 6 Mads Kiilerich 2008-12-08 10:31:48 EST
glibc-2.9-3 mentioned on bug 459756 seems to work around this problem too. 

BTW: I notice that now requests are sent with 5 s intervals, but while A requests are retried 4 times AAAA requests are only retried 2 times. I don't know if that is intended, but it makes it more OK that it fails with EAI_NONAME in case of failure. At least as long as I don't use IPv6 ...

Thanks, Jakub! I will keep testing it to make sure it really works.
Comment 7 Ulrich Drepper 2008-12-08 17:23:28 EST
Let's dupe this bug.  No reason to keep it open as well, it's the same issue.

*** This bug has been marked as a duplicate of bug 459756 ***

Note You need to log in before you can comment on or make changes to this bug.