From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050416 Fedora/1.0.3-1.3.1 Firefox/1.0.3 Description of problem: If the first nameserver in /etc/resolv.conf exists and responds but gives no useful info then bind should try the next one, but it doesn't. Example: > more /etc/resolv.conf ; generated by /sbin/dhclient-script nameserver 192.135.10.4 nameserver 192.135.10.18 # So there are 2 nameservers in resolv.conf. # Try looking up an address: > nslookup wuphys.wustl.edu Server: 192.135.10.4 Address: 192.135.10.4#53 Non-authoritative answer: *** Can't find wuphys.wustl.edu: No answer # bind used the first nameserver and got "No answer". Then it gave up. # Force it to use the second nameserver: > nslookup wuphys.wustl.edu 192.135.10.18 Server: 192.135.10.18 Address: 192.135.10.18#53 Non-authoritative answer: Name: wuphys.wustl.edu Address: 128.252.125.70 # So it should have gone on and tried the second one when the first one # proved useless. Version-Release number of selected component (if applicable): ord/info>rpm -q bind-utils bind-utils-9.2.5-1 How reproducible: Always Steps to Reproduce: see above Actual Results: see above Expected Results: see above Additional info:
Sorry for the delay in responding to this bug report - I just returned from vacation today. nslookup is deprecated - you should be using "host" or "dig". Do you get the same results from "host wuphys.wustl.edu." ? The first nameserver returns a response with an empty answer section, which is the only way the "No answer" response string can be generated by nslookup. Note that the whole DNS system depends on there being one set of authoritative data for any given zone: if more than one server can produce responses for the same DNS name, only one server can be an authoritative "master" server for the zone containing the name, and all other servers for the zone must be slaves of the single master server. So the fact that two servers return different responses for the same name points to a fundamental misconfiguration of them. If a server had returned no response at all or was unreachable, then the next server would have been tried. If a server returns NXDOMAIN, SERVFAIL or an empty answer section, but says that it is authoritative for the zone, then no other servers are tried, because there can be only one authoritative content for a given zone's data in the DNS. I've not been able to cause a 9.2.5 or 9.3.1 server to return a DNS response with an empty answer section . Please supply some further information for this bug report: 1. What BIND version is running on the first server ? You can determine this with the following query: # dig CH TXT version.bind. @192.135.10.4 2. What SOA + NS information do the servers have about the zone ? If you have access to the master zone database files for this zone, please send them to me - otherwise, do: # ( dig wuphys.wustl.edu. ANY @192.135.10.4; \ dig wuphys.wustl.edu. ANY @192.135.10.18; \ ) | tee /tmp/digany.log and append the /tmp/digany.log file to this bug report (or send it to jvdias). 3. Please gather a tcpdump of DNS traffic during a reproduction of the problem: # tcpdump -nl -vvv -s 2048 port domain 2>&1 | tee /tmp/tcpdump.log& # nslookup wuphys.wustl.edu # pkill tcpdump and append the /tmp/tcpdump.log to this bug report or send it to me. Thank you!
Created attachment 116392 [details] digchtxt.log dig output Output of dig query
Created attachment 116393 [details] digany.log dig output Output of dig query of bad and good servers
Created attachment 116394 [details] tcpdump.log tcpdump of DNS traffic
> Do you get the same results from "host wuphys.wustl.edu." ? Yes I do. 192.135.10.4 returns "no answer", but 192.135.10.18 gives an answer. ---- > host wuphys.wustl.edu 192.135.10.4 Using domain server: Name: 192.135.10.4 Address: 192.135.10.4#53 Aliases: > host wuphys.wustl.edu 192.135.10.18 Using domain server: Name: 192.135.10.18 Address: 192.135.10.18#53 Aliases: wuphys.wustl.edu has address 128.252.125.70 ---- > So the fact that two servers return different responses for the same > name points to a fundamental misconfiguration of them. Yes, it does. I still think it is valid to call this a bug because it would be easy to make bind more robust against this kind of misconfiguration, by having it go on to another server even if it gets an authoritative "no answer". The problem arose for me when I was visiting an institution abroad, where they set us up with a local wireless network connection. Everyone with Windows could connect to their home institutions, but those with Linux could not. I finally traced the problem to this misconfiguration of the name servers. But the fact that Windows is robust against it and Linux isn't indicates that this is an area where Linux could be improved. > 1. What BIND version is running on the first server ? > You can determine this with the following query: > # dig CH TXT version.bind. @192.135.10.4 See attached file digchtxt.log I am happy to do this (and the others below), but I am not sure why you want me to: couldn't you type the command just as easily yourself? > 2. What SOA + NS information do the servers have about the zone ? > If you have access to the master zone database files for this > zone, please send them to me - otherwise, do: > # ( dig wuphys.wustl.edu. ANY @192.135.10.4; \ > dig wuphys.wustl.edu. ANY @192.135.10.18; \ > ) | tee /tmp/digany.log See attached file digany.log > 3. Please gather a tcpdump of DNS traffic during a reproduction of > the problem: > # tcpdump -nl -vvv -s 2048 port domain 2>&1 | tee /tmp/tcpdump.log& > # nslookup wuphys.wustl.edu > # pkill tcpdump > and append the /tmp/tcpdump.log to this bug report or send it to > me. See attached file tcpdump.log. I explicitly told nslookup to use the misconfigured server 192.135.10.4, so the commands I actually typed were: # tcpdump -nl -vvv -s 2048 port domain 2>&1 | tee /tmp/tcpdump.log& # nslookup wuphys.wustl.edu 192.135.10.4 # pkill tcpdump
Many thanks for the information . Sorry, I did not realize that the 192.135.10.4 server also had a public internet address and that I could have done the CH TXT queries myself . This server certainly seems most unwell, as it does not even know what version it is. The 192.135.10.4 has no authoritative data for the zone, and recursion is disabled, so it sends an empty answer section and the root nameservers as a referral in the additional section. Unfortunately, this problem cannot be fixed in BIND currently: the issue is not what BIND does, but what the glibc resolver does. The glibc resolver also does not try another server once a server sends an NXDOMAIN or empty answer response to a query - this is why your linux machines failed to "connect" - all applications would be using the glibc resolver, not BIND . So fixing this problem in the BIND utilities would give the misleading impression that the DNS setup was OK, when applications would still be unable to resolve DNS names using glibc. If a server responds to a query, its answer is accepted - this is the way BIND is specified to work, as stated in RFC 1034, section 5.3.3, on the Resolver Algorithm: " The top level algorithm has four steps: 1. See if the answer is in local information, and if so return it to the client. 2. Find the best servers to ask. 3. Send them queries until one returns a response. " ^^^ And it goes on to say: " Step 3 sends out queries until a response is received. " ie. ANY response from a nameserver to a query will terminate the query. This approach minimizes network traffic, and has the advantage that server misconfiguration problems are quickly exposed, as you discovered. Also, the way the BIND utilities behave agrees with how the glibc resolver behaves: if either get a an empty answer section referral, they do not try the next server. The BIND utilities' behavior should not be altered until the glibc resolver's behavior is altered. The BIND named nameserver, when in forwarding mode, will respond to an NXDOMAIN, SERVFAIL, or empty answer referral response by trying the next server in the forwarders list . So one workaround for this problem is to install the caching-nameserver package, and setup forwarding zones in /etc/named.conf, such as: 'zone "wuphys.wustl.edu" IN { type forward; forwarders { 192.135.10.4 ; 192.135.10.18; }; }; ' and run named on boot. I have raised glibc enhancement bug 162625 on this issue; once this is fixed in glibc, then it can be fixed in BIND.
OK, thank you for looking in to this so exhaustively. In principle you're right that the current scheme "has the advantage that server misconfiguration problems are quickly exposed". However in this case, since the Windows majority were all doing fine, the attitude of the sysadmins was that this was a "problem with Linux". I showed them that one of their servers was misconfigured, and the result was... well, as we have just found, two months later it is still misconfigured. So in the interest of robustness, usability etc I think it would be good for the glibc resolver to try another server, even if that is not exactly what the standard prescribes. I hope the glibc maintainers will see things that way.
A glibc patch was submitted for bug 162625 which makes glibc try the next server for empty answer responses. Patches were submitted to ISC and ISC bugs were raised on this issue: #15005 : dighost.c should try next server on empty answer "recursion denied" referrals #15006 : host and nslookup should try next server on SERVFAIL responses Once the glibc patch is applied, the BIND patch for #15005 will be applied. The #15005 issue will be fixed with the next BIND release.
This issue is now fixed, but only in Rawhide / FC5, since only the rawhide glibc-2.3.90-2+ has the requisite patch to fix glibc bug 162625 and BIND's resolver utilities should not give different results to glibc's resolver. The glibc-2.3.90-2 resolver will now try the next server on empty referral responses. In the rawhide bind-9.3.1-7, 'host' and 'nslookup' now by default will try the next server on a SERVFAIL response. Host has a new '-s' option to end the query on SERVFAIL, and nslookup has a '[no]fail' option, similar to dig's '[no]fail' option, except that it defaults to false. Without host's '-s' option or nslookup's 'fail' option, or with dig's 'nofail' option, empty referral responses are treated in the same way as SERVFAIL responses: the next server is tried.
Jason Vas Dias, Is a errata being planned for FC3/FC4. If not closing this as fixed rawhide would be appropriate
Fedora Core 3 is now maintained by the Fedora Legacy project for security updates only. If this problem is a security issue, please reopen and reassign to the Fedora Legacy product. If it is not a security issue and hasn't been resolved in the current FC5 updates or in the FC6 test release, reopen and change the version to match. Thank you!