Bug 57998

Summary:	Glibc 2.2.4 dns lookup is buggy
Product:	[Retired] Red Hat Linux	Reporter:	Need Real Name <jared_robinson>
Component:	glibc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	8.0	CC:	alfredo.maria.ferrari, david, fweimer, gary.r.hicks, jeremy, jlaidman, jmorton, k.georgiou, kjetilho, mrubel, pekkas, redhat.com, tao
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
URL:	http://jaredrobinson.com/dns.txt
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2003-11-05 18:30:48 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Need Real Name 2002-01-04 18:35:00 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.1 (X11; Linux i686; U;) Gecko/20011226

Description of problem:
glibc 2.2.4 DNS network lookups are extremely slow on short hostnames. 
Glibc also ignores /etc/nsswitch.conf "hosts" setting.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Get a DNS server that doesn't respond to ipv6 ("AAAA") requests.
2. Make sure you RH 7.2 system is setup for ipv4.
3. tcpdump -i eth0 -np port domain
4. telnet saturn <--- a short hostname (not fully qualified)
5. ping saturn 
6. Note that telnet uses libc to do the dns lookup.  libc is doing a "AAAA"
query, which fails, and eventually falls back to an "A" query, which succeeds.
7. Note that ping uses libresolv, which does the "A" query first. (Use ldd
to figure out what libraries programs are using)
8. telnet localhost
9  ping localhost
10. Note that "telnet localhost" does a DNS query, even when
/etc/nsswitch.conf says to look in files before looing at DNS.
11. Note that "ping localhost" does not do a DNS query. It resolved it from
the /etc/hosts file.	

Actual Results:  Telnet, SSH, ncftp and other utilities that use glibc to
do lookups are slow at resolving short-hostnames.
Ping, mozilla, etc that use libresolv, work correctly, and quickly.  They
honor /etc/hosts.

Expected Results:  glibc should follow /etc/nsswitch.conf, and first lookup
in /etc/hosts, then query DNS with ipv4 "A" requests..

Additional info: See the URL above for my research, including tcpdump results.

Comment 1 Alfredo Ferrari 2002-01-14 15:47:28 UTC

This bug is extremely serious on local (home-like) networks. Suppose you have
few (>=2) computers connected via ethernet and with their addresses inside
/etc/hosts on both machines (typical home situation) and that presently you are
NOT connected to internet. Whichever of telnet, ssh (yes!), ftp etc will fail
because the resolver isn't satisfied with the /etc/hosts match. According to a
looong discussion on glibc-alpha, it tries to get an Ipv6 number as well, even
though the machine is NOT setup for Ipv6. So it tries the nameservers listed in
/etc/resolv.conf, if no outside connection is available it hangs forever (or at
least long enough to be fully unusable). Even if the connection is on, for going
to a machine 1 m far, it goes to the ISP nameserver which usually does not know
about your internal numbers.
If /etc/resolv.conf is empty everything works (no nameserver to ask for Ipv6
addresses...), of course you are unable to surf the net.... unless playing
gymnastics with resolv.conf (renaming it when not connected).

If I put dummy Ipv6 addresses besides the "good" Ipv4 ones inside /etc/hosts
(duplicating all entries), it works (complaining about an unusable address...)

I would like to stress this is really a killer for all home-made networks.

Comment 2 Pekka Savola 2002-01-17 22:41:32 UTC

This is caused by buggy getaddrinfo/getnameinfo implementation that most IPv6-enabled
software use, please see:

http://sources.redhat.com/ml/libc-alpha/2001-11/msg00125.html

Comment 3 Pekka Savola 2002-01-18 07:59:19 UTC

Let me add that I believe this is a high-priority issue, as it affects all
applications using PF_UNSPEC (mainly those meant to be protocol-independent)
and get*info().

Unfortunately, rewriting parts of the resolver code, as mentioned in BUGS, may
be
necessary :-(

Comment 4 Pekka Savola 2002-01-18 08:04:16 UTC

*** Bug 53929 has been marked as a duplicate of this bug. ***

Comment 5 Pekka Savola 2002-01-18 08:14:49 UTC

The others have had the same problem, fixed now though:

http://mail-index.netbsd.org/tech-net/2000/02/10/0009.html
http://mail-index.netbsd.org/tech-net/2000/02/11/0000.html

Comment 6 Jakub Jelinek 2002-01-27 10:25:02 UTC

*** Bug 58852 has been marked as a duplicate of this bug. ***

Comment 7 John Hardin 2002-09-05 22:26:46 UTC

This also affects GLIBC 2.1 in RH6.2 - please don't forget us non-bleeding-edge
folks!

Comment 8 Peter Fales 2002-10-11 21:59:24 UTC

Here is another request for help with this, as we're still seeing it in RH7.3
with all updates applied.  Is there a reason why it hasn't been fixed?  Is 
there any workaround - something else that could be put in /etc/hosts to 
make it it work?  Is there anything we can do short of building our telnet 
from source?

Comment 9 Pekka Savola 2002-10-11 22:14:55 UTC

The fix involves a rewrite of a part of glibc, and glibc developers have deemed that a low-priority item.

The issue cannot be worked around.  Well, you could try to add something like 'hostname ::1', where hostname would be the node you wish to 
connect to, in /etc/hosts, but I doubt it works as you expect.

Comment 10 Need Real Name 2002-10-12 20:33:44 UTC

Just tried it out on RedHat 8.0, and it is still buggy.  I'd have expected it to
be fixed by now.

Comment 11 Need Real Name 2002-10-13 06:06:06 UTC

Also, this effectively 'breaks' our LVS cluster.  LVS realservers cannot talk 
to the cluster.  Only machines outside of the cluster can talk to it.  
Normally, we could simply add a host to /etc/hosts, pointing to an internal IP 
to avoid the LVS director.  But because of this bug, we resolve the director 
IP, and hence can't hit any realservers from within the cluster (Which makes 
it difficult to execute certain tasks).

Comment 12 Need Real Name 2002-10-16 22:05:10 UTC

Don't know if anyone is still interested, but here's a workaround:

strace reveals what's really happening:

connect(3, {sin_family=AF_UNIX, path="/var/run/.nscd_socket"}, 110) = 0
write(3, "\2\0\0\0\5\0\0\0\17\0\0\0", 12) = 12
write(3, "www.yahoo.com.\0", 15)

Note the trailing dot at the end of the hostname - that works fine and dandy w/ 
DNS, but not so well w/ /etc/hosts.  So a line like this:

127.0.0.1 www.yahoo.com. www.yahoo.com

Will work as you might expect. (The second www.yahoo.com without the trailing 
dot is required as well, since libresolv searches /etc/hosts for hostnames 
w/out the trailing dot).

Comment 13 Peter Fales 2002-10-16 22:20:10 UTC

This workaround (putting the extra . at the end
of the name in /etc/hosts) doesn't seem to 
work for me.  Does it only work if you are running 
nscd?

Comment 14 Need Real Name 2002-10-16 22:45:32 UTC

Hmm, I looked at it a bit more closely after your message.  It works with and 
without nscd for me, but I see that it still does the DNS lookup (But returns 
the address found in /etc/hosts).  That's not really a problem for my purposes, 
since our DNS server is fast enough, and I'm only interested in overriding what 
it returns.

If that's the root of the problem for you, and your site doesn't use IPv6, you 
could use both workarounds mentioned here:

127.0.0.1 www.yahoo.com. www.yahoo.com
::1 www.yahoo.com.

If you're not using IPv6, the "::1" entry will fail quickly, and continue on to 
the IPv4 hosts entry.

Hope that helps.

Comment 15 Ulrich Drepper 2003-04-22 02:42:29 UTC

Try RHL9.  The glibc in that release has quite a few changes in getaddrinfo
which should make it behave better or even "as expected" when it comes to IPv6.

Comment 16 Bojan Smojver 2003-06-13 06:43:34 UTC

This is from RHL 9, obtained via tcpdump:

---------------------------------------------
16:30:50.726686 127.0.0.1.32828 > 127.0.0.1.domain:  13179+ AAAA? router.rexursi
ve.com. (38) (DF)
16:30:50.728385 127.0.0.1.domain > 127.0.0.1.32828:  13179* 0/1/0 (89) (DF)
16:30:50.729013 127.0.0.1.32828 > 127.0.0.1.domain:  13180+ AAAA? router. (24) (
DF)
16:30:50.734473 172.27.0.12.32827 > 192.5.5.241.domain:  21380 [1au] NS? . (28) 
(DF)
16:30:50.734542 172.27.0.12.32827 > 192.5.5.241.domain:  54414 [1au] AAAA? route
r. (35) (DF)
16:30:50.979328 192.5.5.241.domain > 172.27.0.12.32827:  21380*- 13/0/14 NS F.RO
OT-SERVERS.NET.,[|domain]
16:30:51.023246 192.5.5.241.domain > 172.27.0.12.32827:  54414 NXDomain*- 0/1/1 
(110)
16:30:51.024614 127.0.0.1.domain > 127.0.0.1.32828:  13180 NXDomain* 0/1/0 (99) 
(DF)
16:30:51.024999 127.0.0.1.32828 > 127.0.0.1.domain:  13181+ A? router.rexursive.
---------------------------------------------

And then the lookup is successful. If the request for "router." is not cached,
the lookup will take a long time (i.e. it'll go out to the DNS root servers on
the Internet and ask there). This makes the lookup rather long. I think this was
the case in RHL 7.x and 8.x as well, as described previously in this bug.

With the AAAA addresses available on the DNS server, the situation is different:

---------------------------------------------
16:38:41.854023 127.0.0.1.32829 > 127.0.0.1.domain:  9597+ AAAA?
router.rexursive.com. (38) (DF)
16:38:41.855508 127.0.0.1.domain > 127.0.0.1.32829:  9597* 1/2/4 AAAA[|domain] (DF)
16:38:41.856069 127.0.0.1.32829 > 127.0.0.1.domain:  9598+ A?
router.rexursive.com. (38) (DF)
16:38:41.856845 127.0.0.1.domain > 127.0.0.1.32829:  9598* 1/2/4 A[|domain] (DF)
---------------------------------------------

The host is resolved in two queries to the DNS server. I really don't understand
DNS all that well, but without the IPv6 addresses set up, telnet tends to go
"outside", which makes local queries take a long time.

With IPv6 addresses set up, I'm getting "socket: Address family not supported by
protocol" when I try to "ssh router". This is ugly, but understandable, given
that I have no IPv6 support on those machines.

Hope this helps in resolving this.

Bojan

Comment 17 Kjetil T. Homme 2003-09-16 05:03:17 UTC

I experienced this problem on our LVS directors, the checking scripts would take
too long and the service deemed down.  the cause was name resolving being too
slow, despite running a caching name server.  sequence of events:

  DNS query localhost AAAA realserver.dom
    => no match
  NIS query some-overloaded-nis-server realserver.dom
    => failure after five seconds or more
  DNS query localhost A realserver.dom
    => success!

workaround for us is to use
  hosts:      files dns [NOTFOUND=return] nis
in /etc/nsswitch.conf.  this is acceptable to us, but not most of the other bug
reporters.

proper fix is IMHO to introduce ipnodes in nss (cf. Solaris) to allow the name
service for IPv6 to be configured separately from IPv4.

Comment 18 Ulrich Drepper 2003-10-03 09:46:56 UTC

Recent glibc versions implement the AI_ADDRCONFIG flag for getaddrinfo().  It
should solve the problem, at least far as it is intended to be solved.

If getaddrinfo is passed PF_UNSPEC the function will determine if the system has
an IPv6 interface.  If not, it will not lookup IPv6 addresses.  And vice versa.

If IPv6 and IPv4 interfaces are present the expected behavior is to look up both
kinds of addresses.

I'll leave this bug open for a bit longer and will close it unless somebody has
a comment.

Comment 19 Ulrich Drepper 2003-10-03 17:10:38 UTC

On request, a bit more information on the availability.

I've committed the changes on 2003-04-24.  They are not in RHL9 or earlier
releases, and since this is an enhancement they are not slated to go into
erratas.  The changes are in the RHEL3 code and the the Fedore Core test 2 release.

Comment 20 Ulrich Drepper 2003-11-05 18:30:48 UTC

RHEL3 and Fedore Core 1 both include the changes.  No backporting
planned.  So I close this bug.

Comment 21 Need Real Name 2004-12-13 17:21:36 UTC

(AI_ADDRCONFIG does _not_ fix the bug. The problem is not that it's  
checking IPv6 addresses but that it was returning them in preference  
to IPv4 addresses from earlier databases. This is a problem even for 
people _with_ IPv6 interfaces.) 
  
However it seems that for the PF_UNSPEC case the bug has in fact been  
fixed. I don't see any DNS queries for hosts that exist  
in /etc/hosts.  
  
However in the PF_INET or PF_INET6 case the problem still exists. If  
you call getaddrinfo(PF_INET6) and there are no IPv6 addresses  
in /etc/hosts then it will do a DNS query, even if there are IPV4  
addresses in /etc/hosts.  
  
This makes the results inconsistent. The following invariant doesn't  
hold:  
  
getaddrinfo(PF_UNSPEC) = union(getaddrinfo(PF_INET6),  
getaddrinfo(PF_INET))  
  
instead if you do the two protocol families separately you get an  
amalgam of /etc/hosts, dns, or other databases.

Comment 22 Need Real Name 2004-12-13 17:23:47 UTC

Hum. This bug isn't popping reopen even though there's new comments. 
Should I open a new bug and reference this old copy?