Description of problem: Initial dns lookups fail. I most often see it with ssh (run from mercurial on an ssh:// address). I caught it on a wireshark trace. ssh sends an A and an AAAA query, and it (only) gets a response for AAAA containing only a SOA, and that causes "ssh: Could not resolve hostname x: Name or service not known". Usually, when the lookup succeeds, I can see that the A response with the right address arrives before the (same) AAAA response. So the error comes when the A response is delayed or lost. I would expect glibc to somehow resend the A query when it don't get a response to it. No specific ipv6 configuration has been made. The setup might be broken, just like all other places where there is partial but unused ipv6 support. I would expect it to work with ipv4 anyway. The nameserver configured in /etc/resolv.conf (and responding) is on LAN and running windows and probably buggy, but it worked with f9, so this looks like a regression. Version-Release number of selected component (if applicable): glibc-2.8.90-16.i686 How reproducible: A couple of times a day. Often enough to be very annoying, but seldom enough to be hard to reproduce ... Additional info: It could perhaps be a variant of Bug 460561 or Bug 469299?
Openssh do memset(&hints, 0, sizeof(hints)); hints.ai_family = family; hints.ai_socktype = SOCK_STREAM; snprintf(strport, sizeof strport, "%u", port); if ((gaierr = getaddrinfo(host, strport, &hints, &aitop)) != 0) fatal("%s: Could not resolve hostname %.100s: %s", __progname, host, ssh_gai_strerror(gaierr)); and are thus not following Ullrichs advice on Bug 459756#24 - can that explain this? Perhaps openssh should be bugged? But for now, for f10: The openssh apparently worked before, and it would probably be easier to put a workaround in glibc than fixing all applications which have another (possibly wrong) opinion on how to do name resolution...
ssh certainly should use AI_ADDRCONFIG. It should still work. You said you captured the traffic. Try to compile a little program with essentially the code in comment #1, run it under strace, capture DNS traffic using wireshark.
Created attachment 324345 [details] dns lookup test program That was hard to reproduce. "My" DNS sometimes AAAA before A, but I can't reproduce it on command. As workaround I use an ugly hack: I use iptables to drop some of the udp responses, tuning on match on package length, so that it completely drops the A answer but AAAA responses come through: -A INPUT -p udp -m length --length 77 -j LOG --log-prefix "dropping " -A INPUT -p udp -m length --length 77 -j DROP -A INPUT -p udp -j LOG --log-prefix "not dropping " With that I can reproduce the ssh behaviour I mentioned - and do the following. I run the attached test program with hg as parameter. resolv.conf has "domain dadomain.com" and "search dadomain.com" and "nameserver 192.168.45.13". When using ai_family = AF_INET then it twice sends an request for A and waits in a poll for 5 s, and then it fails with -2 (EAI_NONAME?). With ai_family = AF_UNSPEC it sends an request for A and an request for AAAA, gets the AAAA response (with soa of the nameserver as authoritative ns for the domain), waits in poll for 5 s, and then it fails the same way: getaddrinfo("hg", "(null)", {ai_family=0, ai_socktype=1}, {}) = -2 (I am not familiar with NS api, but in either case I would expect getaddrinfo to return something more like EAI_AGAIN (-3) instead of being conclusive after just 1-2 lossy udp attempts.) The relevant(?) output from strace -v -s200: socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.45.13")}, 28) = 0 fcntl64(3, F_GETFL) = 0x2 (flags O_RDWR) fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 gettimeofday({1227298167, 857199}, NULL) = 0 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) send(3, "\232\314\1\0\0\1\0\0\0\0\0\0\2hg\10dadomain\3com\0\0\1\0\1"..., 33, MSG_NOSIGNAL) = 33 poll([{fd=3, events=POLLIN|POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}]) send(3, "\205\302\1\0\0\1\0\0\0\0\0\0\2hg\10dadomain\3com\0\0\34\0\1"..., 33, MSG_NOSIGNAL) = 33 gettimeofday({1227298167, 858053}, NULL) = 0 poll([{fd=3, events=POLLIN}], 1, 4999) = 1 ([{fd=3, revents=POLLIN}]) ioctl(3, FIONREAD, [84]) = 0 recvfrom(3, "\205\302\205\200\0\1\0\0\0\1\0\0\2hg\10dadomain\3com\0\0\34\0\1\300\17\0\6\0\1\0\0\16\20\0'\4srvx\300\17\nhostmaster\0\0\0\27E\0\0\3\204\0\0\2X\0\1Q\200\0\0\16\20"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.45.13")}, [16]) = 84 gettimeofday({1227298167, 858786}, NULL) = 0 poll([{fd=3, events=POLLIN}], 1, 4998) = 0 (Timeout) close(3) = 0 Using glibc-2.9-2.i686. Domain name carefully modified to protect the innocent.
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle. Changing version to '10'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
I faced the same problem with glibc/kernel in F-10 release. $ rpm -q glibc kernel glibc-2.9-2.i686 kernel-2.6.27.5-117.fc10.i686 To avoid this, I disabled ipv6 using modprobe.conf. $ cat /etc/modprobe.conf install ipv6 : This workaround would not solve the problem at all, but it would be helpful to users who does not use IPv6.
glibc-2.9-3 mentioned on bug 459756 seems to work around this problem too. BTW: I notice that now requests are sent with 5 s intervals, but while A requests are retried 4 times AAAA requests are only retried 2 times. I don't know if that is intended, but it makes it more OK that it fails with EAI_NONAME in case of failure. At least as long as I don't use IPv6 ... Thanks, Jakub! I will keep testing it to make sure it really works.
Let's dupe this bug. No reason to keep it open as well, it's the same issue. *** This bug has been marked as a duplicate of bug 459756 ***