Bug 160914 - bind does not go on to second DNS when first one gives "No answer"
Summary: bind does not go on to second DNS when first one gives "No answer"
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: bind
Version: 3
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Martin Stransky
QA Contact:
URL:
Whiteboard:
Depends On: 162625
Blocks: 756458
TreeView+ depends on / blocked
 
Reported: 2005-06-18 14:30 UTC by Mark Alford
Modified: 2014-11-21 16:40 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 756458 (view as bug list)
Environment:
Last Closed: 2006-08-08 11:46:08 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
digchtxt.log dig output (464 bytes, text/plain)
2005-07-05 23:02 UTC, Mark Alford
no flags Details
digany.log dig output (1.83 KB, text/plain)
2005-07-05 23:04 UTC, Mark Alford
no flags Details
tcpdump.log (714 bytes, text/plain)
2005-07-05 23:06 UTC, Mark Alford
no flags Details

Description Mark Alford 2005-06-18 14:30:29 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050416 Fedora/1.0.3-1.3.1 Firefox/1.0.3

Description of problem:
If the first nameserver in /etc/resolv.conf exists and responds but gives
no useful info then bind should try the next one, but it doesn't.

Example:
> more /etc/resolv.conf
; generated by /sbin/dhclient-script
nameserver 192.135.10.4
nameserver 192.135.10.18

# So there are 2 nameservers in resolv.conf.
# Try looking up an address:  
> nslookup wuphys.wustl.edu
Server:         192.135.10.4
Address:        192.135.10.4#53

Non-authoritative answer:
*** Can't find wuphys.wustl.edu: No answer

# bind used the first nameserver and got "No answer". Then it gave up.
# Force it to use the second nameserver:
> nslookup wuphys.wustl.edu 192.135.10.18
Server:         192.135.10.18
Address:        192.135.10.18#53

Non-authoritative answer:
Name:   wuphys.wustl.edu
Address: 128.252.125.70

# So it should have gone on and tried the second one when the first one
# proved useless.

Version-Release number of selected component (if applicable):
ord/info>rpm -q bind-utils
bind-utils-9.2.5-1


How reproducible:
Always

Steps to Reproduce:
see above


  

Actual Results:  see above

Expected Results:  see above

Additional info:

Comment 1 Jason Vas Dias 2005-07-05 21:40:53 UTC
Sorry for the delay in responding to this bug report - I just 
returned from vacation today.

nslookup is deprecated - you should be using "host" or "dig".

Do you get the same results from "host wuphys.wustl.edu." ?

The first nameserver returns a response with an empty answer section,
which is the only way the "No answer" response string can be generated 
by nslookup.

Note that the whole DNS system depends on there being one set of 
authoritative data for any given zone: if more than one server can
produce responses for the same DNS name, only one server can be an
authoritative "master" server for the zone containing the name, 
and all other servers for the zone must be slaves of the single 
master server.

So the fact that two servers return different responses for the same
name points to a fundamental misconfiguration of them.

If a server had returned no response at all or was unreachable, 
then the next server would have been tried.
If a server returns NXDOMAIN, SERVFAIL or an empty answer section,
but says that it is authoritative for the zone, then no other 
servers are tried, because there can be only one authoritative 
content for a given zone's data in the DNS.

I've not been able to cause a 9.2.5 or 9.3.1 server to return a 
DNS response with an empty answer section .

Please supply some further information for this bug report:

1. What BIND version is running on the first server ? 
   You can determine this with the following query:
   # dig CH TXT version.bind. @192.135.10.4

2. What SOA + NS information do the servers have about the zone ?
   If you have access to the master zone database files for this
   zone, please send them to me - otherwise, do:
   # ( dig wuphys.wustl.edu. ANY @192.135.10.4; \
       dig wuphys.wustl.edu. ANY @192.135.10.18; \
     ) | tee /tmp/digany.log
   and append the /tmp/digany.log file to this bug report
   (or send it to jvdias).

3. Please gather a tcpdump of DNS traffic during a reproduction of
   the problem:
   # tcpdump -nl -vvv -s 2048 port domain 2>&1 | tee /tmp/tcpdump.log&
   # nslookup wuphys.wustl.edu
   # pkill tcpdump
   and append the /tmp/tcpdump.log to this bug report or send it to 
   me.

Thank you!



Comment 2 Mark Alford 2005-07-05 23:02:24 UTC
Created attachment 116392 [details]
digchtxt.log  dig output

Output of dig query

Comment 3 Mark Alford 2005-07-05 23:04:10 UTC
Created attachment 116393 [details]
digany.log dig output

Output of dig query of bad and good servers

Comment 4 Mark Alford 2005-07-05 23:06:03 UTC
Created attachment 116394 [details]
tcpdump.log 

tcpdump of DNS traffic

Comment 5 Mark Alford 2005-07-05 23:09:43 UTC
> Do you get the same results from "host wuphys.wustl.edu." ?
Yes I do. 192.135.10.4 returns "no answer", but 192.135.10.18 gives an answer.

----
> host wuphys.wustl.edu 192.135.10.4
Using domain server:
Name: 192.135.10.4
Address: 192.135.10.4#53
Aliases:

> host wuphys.wustl.edu 192.135.10.18
Using domain server:
Name: 192.135.10.18
Address: 192.135.10.18#53
Aliases:

wuphys.wustl.edu has address 128.252.125.70
----

> So the fact that two servers return different responses for the same
> name points to a fundamental misconfiguration of them.

Yes, it does. I still think it is valid to call this a bug because
it would be easy to make bind more robust against this kind of 
misconfiguration, by having it go on to another server even if it
gets an authoritative "no answer".
The problem arose for me when I was visiting an institution abroad,
where they set us up with a local wireless network connection.
Everyone with Windows could connect to their home institutions, but those
with Linux could not. I finally traced the problem to this misconfiguration
of the name servers. But the fact that Windows is robust against it and
Linux isn't indicates that this is an area where Linux could be improved.



> 1. What BIND version is running on the first server ? 
>    You can determine this with the following query:
>    # dig CH TXT version.bind. @192.135.10.4

See attached file digchtxt.log
I am happy to do this (and the others below), but I am not sure why
you want me to: couldn't you type the command just as easily yourself?

> 2. What SOA + NS information do the servers have about the zone ?
>    If you have access to the master zone database files for this
>    zone, please send them to me - otherwise, do:
>    # ( dig wuphys.wustl.edu. ANY @192.135.10.4; \
>        dig wuphys.wustl.edu. ANY @192.135.10.18; \
>      ) | tee /tmp/digany.log

See attached file digany.log


> 3. Please gather a tcpdump of DNS traffic during a reproduction of
>    the problem:
>    # tcpdump -nl -vvv -s 2048 port domain 2>&1 | tee /tmp/tcpdump.log&
>    # nslookup wuphys.wustl.edu
>    # pkill tcpdump
>    and append the /tmp/tcpdump.log to this bug report or send it to 
>    me.

See attached file tcpdump.log. 

I explicitly told nslookup to use the misconfigured server
192.135.10.4, so the commands I actually typed were:
   # tcpdump -nl -vvv -s 2048 port domain 2>&1 | tee /tmp/tcpdump.log&
   # nslookup wuphys.wustl.edu 192.135.10.4
   # pkill tcpdump



Comment 6 Jason Vas Dias 2005-07-06 22:08:12 UTC
Many thanks for the information . Sorry, I did not realize that 
the 192.135.10.4 server also had a public internet address and 
that I could have done the CH TXT queries myself . This server
certainly seems most unwell, as it does not even know what version
it is.

The 192.135.10.4 has no authoritative data for the zone, and 
recursion is disabled, so it sends an empty answer section and
the root nameservers as a referral in the additional section.

Unfortunately, this problem cannot be fixed in BIND currently:
the issue is not what BIND does, but what the glibc resolver does.

The glibc resolver also does not try another server once a 
server sends an NXDOMAIN or empty answer response to a query - 
this is why your linux machines failed to "connect" - 
all applications would be using the glibc resolver, not BIND . 

So fixing this problem in the BIND utilities would give 
the misleading impression that the DNS setup was OK, when 
applications would still be unable to resolve DNS names
using glibc.

If a server responds to a query, its answer is accepted -
this is the way BIND is specified to work, as stated in RFC 1034,
section 5.3.3, on the Resolver Algorithm:
"
The top level algorithm has four steps:

   1. See if the answer is in local information, and if so return
      it to the client.

   2. Find the best servers to ask.

   3. Send them queries until one returns a response.
"                             ^^^
And it goes on to say:
"
Step 3 sends out queries until a response is received.
"
ie. ANY response from a nameserver to a query will terminate the 
query. 

This approach minimizes network traffic, and has the advantage that
server misconfiguration problems are quickly exposed, as you 
discovered. 

Also, the way the BIND utilities behave agrees with how the glibc
resolver behaves: if either get a an empty answer section referral,
they do not try the next server.  The BIND utilities' behavior 
should not be altered until the glibc resolver's behavior is altered.

The BIND named nameserver, when in forwarding mode, will respond to an
NXDOMAIN, SERVFAIL, or empty answer referral response by trying the
next server in the forwarders list .
So one workaround for this problem is to install the
caching-nameserver package, and setup forwarding zones in
/etc/named.conf, such as: 
  'zone "wuphys.wustl.edu" IN { 
       type forward; 
       forwarders {  192.135.10.4 ; 192.135.10.18; };
   };
  '
and run named on boot.

I have raised glibc enhancement bug 162625 on this issue; once
this is fixed in glibc, then it can be fixed in BIND.






Comment 7 Mark Alford 2005-07-06 22:20:18 UTC
OK, thank you for looking in to this so exhaustively.

In principle you're right that the current scheme "has the 
advantage that server misconfiguration problems are quickly exposed".
However in this case, since the Windows majority were all doing fine,
the attitude of the sysadmins was that this was a "problem with Linux".
I showed them that one of their servers was misconfigured, and the
result was... well, as we have just found, two months later it is
still misconfigured.

So in the interest of robustness, usability etc I think it would be
good for the glibc resolver to try another server, even if that is not
exactly what the standard prescribes. I hope the glibc maintainers
will see things that way.

Comment 8 Jason Vas Dias 2005-07-07 20:53:28 UTC
A glibc patch was submitted for bug 162625 which makes glibc try the next
server for empty answer responses.

Patches were submitted to ISC and ISC bugs were raised on this issue:
  #15005 : dighost.c should try next server on empty answer "recursion denied"
referrals
  #15006 : host and nslookup should try next server on SERVFAIL responses

Once the glibc patch is applied, the BIND patch for #15005 will be applied.
The #15005 issue will be fixed with the next BIND release.
 


Comment 9 Jason Vas Dias 2005-07-12 21:46:28 UTC
This issue is now fixed, but only in Rawhide / FC5, since only the rawhide
glibc-2.3.90-2+ has the requisite patch to fix glibc bug 162625 and
BIND's resolver utilities should not give different results to glibc's 
resolver. 
The glibc-2.3.90-2 resolver will now try the next server on empty
referral responses.
In the rawhide bind-9.3.1-7, 'host' and 'nslookup' now by default will 
try the next server on a SERVFAIL response. Host has a new '-s' option
to end the query on SERVFAIL, and nslookup has a '[no]fail' option, similar
to dig's '[no]fail' option, except that it defaults to false. Without host's
'-s' option or nslookup's 'fail' option, or with dig's 'nofail' option, 
empty referral responses are treated in the same way as SERVFAIL responses:
the next server is tried.

Comment 10 Rahul Sundaram 2005-09-05 08:19:43 UTC
Jason Vas Dias,

Is a errata being planned for FC3/FC4. If not closing this as fixed rawhide
would be appropriate

Comment 12 Matthew Miller 2006-07-10 20:38:50 UTC
Fedora Core 3 is now maintained by the Fedora Legacy project for security
updates only. If this problem is a security issue, please reopen and
reassign to the Fedora Legacy product. If it is not a security issue and
hasn't been resolved in the current FC5 updates or in the FC6 test
release, reopen and change the version to match.

Thank you!



Note You need to log in before you can comment on or make changes to this bug.