Bug 57998
Summary: | Glibc 2.2.4 dns lookup is buggy | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Need Real Name <jared_robinson> |
Component: | glibc | Assignee: | Jakub Jelinek <jakub> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 8.0 | CC: | alfredo.maria.ferrari, david, fweimer, gary.r.hicks, jeremy, jlaidman, jmorton, k.georgiou, kjetilho, mrubel, pekkas, redhat.com, tao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
URL: | http://jaredrobinson.com/dns.txt | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-11-05 18:30:48 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Need Real Name
2002-01-04 18:35:00 UTC
This bug is extremely serious on local (home-like) networks. Suppose you have few (>=2) computers connected via ethernet and with their addresses inside /etc/hosts on both machines (typical home situation) and that presently you are NOT connected to internet. Whichever of telnet, ssh (yes!), ftp etc will fail because the resolver isn't satisfied with the /etc/hosts match. According to a looong discussion on glibc-alpha, it tries to get an Ipv6 number as well, even though the machine is NOT setup for Ipv6. So it tries the nameservers listed in /etc/resolv.conf, if no outside connection is available it hangs forever (or at least long enough to be fully unusable). Even if the connection is on, for going to a machine 1 m far, it goes to the ISP nameserver which usually does not know about your internal numbers. If /etc/resolv.conf is empty everything works (no nameserver to ask for Ipv6 addresses...), of course you are unable to surf the net.... unless playing gymnastics with resolv.conf (renaming it when not connected). If I put dummy Ipv6 addresses besides the "good" Ipv4 ones inside /etc/hosts (duplicating all entries), it works (complaining about an unusable address...) I would like to stress this is really a killer for all home-made networks. This is caused by buggy getaddrinfo/getnameinfo implementation that most IPv6-enabled software use, please see: http://sources.redhat.com/ml/libc-alpha/2001-11/msg00125.html Let me add that I believe this is a high-priority issue, as it affects all applications using PF_UNSPEC (mainly those meant to be protocol-independent) and get*info(). Unfortunately, rewriting parts of the resolver code, as mentioned in BUGS, may be necessary :-( *** Bug 53929 has been marked as a duplicate of this bug. *** The others have had the same problem, fixed now though: http://mail-index.netbsd.org/tech-net/2000/02/10/0009.html http://mail-index.netbsd.org/tech-net/2000/02/11/0000.html *** Bug 58852 has been marked as a duplicate of this bug. *** This also affects GLIBC 2.1 in RH6.2 - please don't forget us non-bleeding-edge folks! Here is another request for help with this, as we're still seeing it in RH7.3 with all updates applied. Is there a reason why it hasn't been fixed? Is there any workaround - something else that could be put in /etc/hosts to make it it work? Is there anything we can do short of building our telnet from source? The fix involves a rewrite of a part of glibc, and glibc developers have deemed that a low-priority item. The issue cannot be worked around. Well, you could try to add something like 'hostname ::1', where hostname would be the node you wish to connect to, in /etc/hosts, but I doubt it works as you expect. Just tried it out on RedHat 8.0, and it is still buggy. I'd have expected it to be fixed by now. Also, this effectively 'breaks' our LVS cluster. LVS realservers cannot talk to the cluster. Only machines outside of the cluster can talk to it. Normally, we could simply add a host to /etc/hosts, pointing to an internal IP to avoid the LVS director. But because of this bug, we resolve the director IP, and hence can't hit any realservers from within the cluster (Which makes it difficult to execute certain tasks). Don't know if anyone is still interested, but here's a workaround: strace reveals what's really happening: connect(3, {sin_family=AF_UNIX, path="/var/run/.nscd_socket"}, 110) = 0 write(3, "\2\0\0\0\5\0\0\0\17\0\0\0", 12) = 12 write(3, "www.yahoo.com.\0", 15) Note the trailing dot at the end of the hostname - that works fine and dandy w/ DNS, but not so well w/ /etc/hosts. So a line like this: 127.0.0.1 www.yahoo.com. www.yahoo.com Will work as you might expect. (The second www.yahoo.com without the trailing dot is required as well, since libresolv searches /etc/hosts for hostnames w/out the trailing dot). This workaround (putting the extra . at the end of the name in /etc/hosts) doesn't seem to work for me. Does it only work if you are running nscd? Hmm, I looked at it a bit more closely after your message. It works with and without nscd for me, but I see that it still does the DNS lookup (But returns the address found in /etc/hosts). That's not really a problem for my purposes, since our DNS server is fast enough, and I'm only interested in overriding what it returns. If that's the root of the problem for you, and your site doesn't use IPv6, you could use both workarounds mentioned here: 127.0.0.1 www.yahoo.com. www.yahoo.com ::1 www.yahoo.com. If you're not using IPv6, the "::1" entry will fail quickly, and continue on to the IPv4 hosts entry. Hope that helps. Try RHL9. The glibc in that release has quite a few changes in getaddrinfo which should make it behave better or even "as expected" when it comes to IPv6. This is from RHL 9, obtained via tcpdump: --------------------------------------------- 16:30:50.726686 127.0.0.1.32828 > 127.0.0.1.domain: 13179+ AAAA? router.rexursi ve.com. (38) (DF) 16:30:50.728385 127.0.0.1.domain > 127.0.0.1.32828: 13179* 0/1/0 (89) (DF) 16:30:50.729013 127.0.0.1.32828 > 127.0.0.1.domain: 13180+ AAAA? router. (24) ( DF) 16:30:50.734473 172.27.0.12.32827 > 192.5.5.241.domain: 21380 [1au] NS? . (28) (DF) 16:30:50.734542 172.27.0.12.32827 > 192.5.5.241.domain: 54414 [1au] AAAA? route r. (35) (DF) 16:30:50.979328 192.5.5.241.domain > 172.27.0.12.32827: 21380*- 13/0/14 NS F.RO OT-SERVERS.NET.,[|domain] 16:30:51.023246 192.5.5.241.domain > 172.27.0.12.32827: 54414 NXDomain*- 0/1/1 (110) 16:30:51.024614 127.0.0.1.domain > 127.0.0.1.32828: 13180 NXDomain* 0/1/0 (99) (DF) 16:30:51.024999 127.0.0.1.32828 > 127.0.0.1.domain: 13181+ A? router.rexursive. --------------------------------------------- And then the lookup is successful. If the request for "router." is not cached, the lookup will take a long time (i.e. it'll go out to the DNS root servers on the Internet and ask there). This makes the lookup rather long. I think this was the case in RHL 7.x and 8.x as well, as described previously in this bug. With the AAAA addresses available on the DNS server, the situation is different: --------------------------------------------- 16:38:41.854023 127.0.0.1.32829 > 127.0.0.1.domain: 9597+ AAAA? router.rexursive.com. (38) (DF) 16:38:41.855508 127.0.0.1.domain > 127.0.0.1.32829: 9597* 1/2/4 AAAA[|domain] (DF) 16:38:41.856069 127.0.0.1.32829 > 127.0.0.1.domain: 9598+ A? router.rexursive.com. (38) (DF) 16:38:41.856845 127.0.0.1.domain > 127.0.0.1.32829: 9598* 1/2/4 A[|domain] (DF) --------------------------------------------- The host is resolved in two queries to the DNS server. I really don't understand DNS all that well, but without the IPv6 addresses set up, telnet tends to go "outside", which makes local queries take a long time. With IPv6 addresses set up, I'm getting "socket: Address family not supported by protocol" when I try to "ssh router". This is ugly, but understandable, given that I have no IPv6 support on those machines. Hope this helps in resolving this. Bojan I experienced this problem on our LVS directors, the checking scripts would take too long and the service deemed down. the cause was name resolving being too slow, despite running a caching name server. sequence of events: DNS query localhost AAAA realserver.dom => no match NIS query some-overloaded-nis-server realserver.dom => failure after five seconds or more DNS query localhost A realserver.dom => success! workaround for us is to use hosts: files dns [NOTFOUND=return] nis in /etc/nsswitch.conf. this is acceptable to us, but not most of the other bug reporters. proper fix is IMHO to introduce ipnodes in nss (cf. Solaris) to allow the name service for IPv6 to be configured separately from IPv4. Recent glibc versions implement the AI_ADDRCONFIG flag for getaddrinfo(). It should solve the problem, at least far as it is intended to be solved. If getaddrinfo is passed PF_UNSPEC the function will determine if the system has an IPv6 interface. If not, it will not lookup IPv6 addresses. And vice versa. If IPv6 and IPv4 interfaces are present the expected behavior is to look up both kinds of addresses. I'll leave this bug open for a bit longer and will close it unless somebody has a comment. On request, a bit more information on the availability. I've committed the changes on 2003-04-24. They are not in RHL9 or earlier releases, and since this is an enhancement they are not slated to go into erratas. The changes are in the RHEL3 code and the the Fedore Core test 2 release. RHEL3 and Fedore Core 1 both include the changes. No backporting planned. So I close this bug. (AI_ADDRCONFIG does _not_ fix the bug. The problem is not that it's checking IPv6 addresses but that it was returning them in preference to IPv4 addresses from earlier databases. This is a problem even for people _with_ IPv6 interfaces.) However it seems that for the PF_UNSPEC case the bug has in fact been fixed. I don't see any DNS queries for hosts that exist in /etc/hosts. However in the PF_INET or PF_INET6 case the problem still exists. If you call getaddrinfo(PF_INET6) and there are no IPv6 addresses in /etc/hosts then it will do a DNS query, even if there are IPV4 addresses in /etc/hosts. This makes the results inconsistent. The following invariant doesn't hold: getaddrinfo(PF_UNSPEC) = union(getaddrinfo(PF_INET6), getaddrinfo(PF_INET)) instead if you do the two protocol families separately you get an amalgam of /etc/hosts, dns, or other databases. Hum. This bug isn't popping reopen even though there's new comments. Should I open a new bug and reference this old copy? |