140528 – named waits a few seconds on cache misses

Bug 140528 - named waits a few seconds on cache misses

Summary: named waits a few seconds on cache misses

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	bind
Sub Component:
Version:	3
Hardware:	athlon
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jason Vas Dias
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	141113 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-11-23 14:11 UTC by Benjamin S. Scarlet
Modified:	2007-11-30 22:10 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-12-23 20:35:05 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log as requested (4.28 KB, text/plain) 2004-11-24 02:28 UTC, Benjamin S. Scarlet	no flags	Details
longer log, more as requested (25.99 KB, text/plain) 2004-11-24 02:43 UTC, Benjamin S. Scarlet	no flags	Details
patch against the latest ISC BIND 9.3.0 source to fix this issue. (689 bytes, patch) 2004-11-29 14:19 UTC, Jason Vas Dias	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:696	0	normal	SHIPPED_LIVE	Updated BIND packages	2004-12-23 05:00:00 UTC

Description Benjamin S. Scarlet 2004-11-23 14:11:43 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041111 Firefox/1.0

Description of problem:
named seems to be waiting several seconds (usually roughly four)
between receiving a request for which it has no cache entry and
issuing a request upstream to get an answer.

The delay doesn't seem to be completely deterministic - it doesn't
necessarily happen the first time after named is started, but it sets
in fairly quickly and then is quite repeatable.


Version-Release number of selected component (if applicable):
bind-9.2.4-2

How reproducible:
Sometimes

Steps to Reproduce:
1. start named
2. start "tethereal -i any port 53" to watch what's going on
3. run "dig one.com @localhost", "dig two.com @localhost", ...
    

Actual Results:  for example, these are two successive lines from the
tethereal capture:

132.329401    127.0.0.1 -> 127.0.0.1    DNS Standard query A four.com
136.334240 xx.xx.xx.xxx -> yyy.yy.yy.yy DNS Standard query A four.com

Note the four second delay between the original query, from dig to
named, and named's reaction, asking upstream to get an answer

Expected Results:  no significant delay

Additional info:


Some google searching found me messages about a different version
which exhibited this symptom:
http://www.monkey.org/openbsd/archive/misc/0411/msg00260.html

Comment 1 Jason Vas Dias 2004-11-23 15:13:07 UTC

It sounds like your server is trying to contact other nameservers
which do not respond and time out .

Do you have 'forwarders { ... ; };' nameservers defined ?

Note that only looking at packets using port 53 may not be sufficient -
do you have :
 ' options { ... 
             query-source port 53;
             ...
           };
 '
defined in named.conf ?

Unless you define this option, named will use any available
port for the source UDP port from which queries originate;
intervening firewalls may block all ports not in a certain 
list, so try setting the query-source option as above,
and see if it fixes the problem.

If it does not fix your problem, please run the following
tcpdump command and append the output to this bug (or send
it to jvdias) :

   tcpdump -nl -vv -i any src port 53 or dst port 53 2>&1 | tee
/tmp/tcpdump.log

Please append / send me the /tmp/tcpdump.log and your /etc/named.conf
files.

Comment 2 Benjamin S. Scarlet 2004-11-24 02:21:39 UTC

Thanks for your prompt response.

1) I do not have any forwarders configured.

2) The filter I specified to tethereal "port 53", (with neither "src"
nor "dst") means the same as your "src port 53 or dst port 53",
according to both the documentation and my experience.

3) While I'm currently experiencing the problem with a modified named
configuration, I have verified that this problem exists (at least for
me) with the default configuration of bind + caching-nameserver, on a
completely fresh install of FC3. I apologize for failing to mention
that fact in the original bug.

Comment 3 Benjamin S. Scarlet 2004-11-24 02:28:55 UTC

Created attachment 107371 [details]
log as requested

As you'll see from the log, this was not made with the query source port
restricted to 53, but it was made with precisely the tcpdump command you
requested (to rule out my being wrong about tethereal filter syntax). Since no
dns packets at all appeared for four seconds after the initial packet from dig
to named (as in my original experiment), I contend that your external filtering
hypothesis is exploded.

Comment 4 Benjamin S. Scarlet 2004-11-24 02:43:19 UTC

Created attachment 107372 [details]
longer log, more as requested

This log was made with the source port restricted to 53. If you look closely,
you can see a pattern which might be interesting. I restarted named just before
generating this log, then ran successive queries for foo<number>.com, with
<number> running 0,1,2,...

The first few queries behaved reasonably, but by the time I got to foo6 the
delay had set in. It seems to persist once it appears, and it appears to show
up relatively quickly. However, as in this example, the first few queries
issued to a fresh process seem to do all right.

Comment 5 Peter J. Dohm 2004-11-24 07:30:26 UTC

hi all.

i believe this problem is actually identical to a fairly well
documented (if you happen to be patient enough to search for it (or
very clever)) problem with bind9 in multi-ip-protocol-stack
(ipv4/ipv6) environments.

i've found a feature request for EL3
(http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=134529) that
suggests that this is understood, but i'm not convinced people see how
broad a spectrum of people are impacted.  for example, you'll find
everyone in the MacOS X world complaining as well!

it appears that recently the root nameservers (or the first level
subordinates for com, etc) starting serving up IPV6 records, and that
has either tickled some sort of bug in bind, the ip stack, whatever. 
i've not dug into the actual root cause yet, but i have followed other
people's advice in employing the work-around (on macosx forums) and
re-compiled bind without ipv6 support, and the symptoms go away...

i think this might have something to do with the ipv6-over-ipv4
support automatically enabled by the kernel in core 3... quickly
trying to disable didn't work because i couldn't seem to get the ipv6
module NOT to load via traditional means...

for the time being, i've created some unofficial rpms for fedora core
3 that are available at http://dohm.dyndns.org/ that will address the
immediate issue.  the ONLY change i made was to pass --disable-ipv6 to
the configure script when building these instead of --enable-ipv6....
 otherwise, everything is identical.

cheers.

Peter Dohm

Comment 6 Peter J. Dohm 2004-11-24 14:21:02 UTC

forgot to request being added to cc list... sorry 'bout that.

Comment 7 Jason Vas Dias 2004-11-24 16:54:47 UTC

OK, I've now been able to duplicate this problem.
Yes, some root servers are supplying their AAAA records before
their A records, so named by default tries to query these root
servers by their IPV6 addresses first.

IF the IPv6 stack is not initialized, then these queries will time out.

In FC3, IPv6 is not initialized by default.

Initializing IPv6 stops the queries timing out and fixes the problem:
In /etc/sysconfig/network, add the line:   
' NETWORKING_IPV6=yes
'

With NETWORKING_IPV6=yes in /etc/sysconfig/network, the timeouts 
do not occur . (Actually, I think the IP subsystem is then able to 
find out that there are no IPv6 routes and the initial IPv6 sends
then fail immediately rather than timing out).

I don't agree that we should totally disable IPv6 support in 
named as comment #5 suggests.

BIND 9.3.0 has much better control of IPv6 features such as 
the '-4' and '-6' named options. BIND 9.3.0 is still being 
tested, but will be released shortly for FC4 / rawhide .
Meanwhile, you can download it from:
http://people.redhat.com/~jvdias/bind/9.3.0

Comment 8 Jason Vas Dias 2004-11-24 17:14:01 UTC

Previous respondent was not added to CC list - now added OK.

This bug is resolved:

 Adding 
         NETWORKING_IPV6=yes
 to /etc/sysconfig/network
 prevents the queries to root nameservers with IPv6 addresses
 timing out.

Comment 9 Benjamin S. Scarlet 2004-11-24 18:25:32 UTC

This change (and a subsequent reboot) did not solve my problem.

Comment 10 Peter J. Dohm 2004-11-24 18:44:23 UTC

nor did it for me.

curious.

jason, what might be different for you configuration?

i'm using iptables to firewall, in case you think that might be
somehow related...

Comment 11 Rob van Nieuwkerk 2004-11-28 18:30:49 UTC

Adding "NETWORKING_IPV6=yes" to /etc/sysconfig/network
(and rebooting) does not solve the problem for me either.

Also tried checking the IPV6 checkbox in system-config-network
for eth0. Same result: problem stays.

There often (always ?) is this 4s delay in bind before sending
anything to the network when it is asked for an uncached entry.

I am not using iptables.

Jason: maybe you tested on an IPV6 connected network ?

Comment 12 Jason Vas Dias 2004-11-29 14:08:52 UTC

*** Bug 141113 has been marked as a duplicate of this bug. ***

Comment 13 Jason Vas Dias 2004-11-29 14:16:57 UTC

Yes, I found the problem - there was an IPV6 router on my network.

Now, without setting up IPV6, the root servers that supply their
IPV6 AAAA records before their A records cause 2 second timeouts
per server AAAA address .

This is because of the hardcoded 2 second timeout in resolver.c,
which was not handling fatal send errors such as ENETUNREACH, but
just let the timeouts expire.

BIND 9.3.0 does exactly the same - the only option would be to
disable IPv6 completely with the '-4' option.

I have now developed a patch to fix this (appended), so that if
a send causes a fatal error such as ENETUNREACH, the query timer
is immediately expired, and the thread then tries the next address
for the server when the timer fires within 10uS of the send error
occurring.

I raised ISC bug 13153 on this issue containing the patch which will
hopefully be in subsequent BIND releases.

Comment 14 Jason Vas Dias 2004-11-29 14:19:16 UTC

Created attachment 107533 [details]
patch against the latest ISC BIND 9.3.0 source to fix this issue.

Comment 15 Jason Vas Dias 2004-11-29 14:22:03 UTC

 The fix will be in BIND 9.2.4-14 (FC3-U1, RHEL-4-U1, RHEL-3-U5) and
 in BIND 9.3.0-3 (FC4/rawhide) and can be downloaded from:
    http://people.redhat.com/~jvdias/bind/

Comment 16 Rob van Nieuwkerk 2004-11-30 00:05:28 UTC

Hi Jason,

Can you please fix:
   http://people.redhat.com/~jvdias/bind/bind-9.2.4-4.src.rpm

It contains only zeroes ..

Comment 17 Rob van Nieuwkerk 2004-11-30 00:21:11 UTC

FYI Jason: with your bind-9.2.4-4 RPMs the annoying delay is gone.

Comment 18 Benjamin S. Scarlet 2004-11-30 04:03:02 UTC

Jason,

Your fix works for me also. Thank you.

Comment 19 Jason Vas Dias 2004-11-30 15:08:08 UTC

 Yes, as you figured out, the version numbers should have been:
      bind-9.2.4-4 (FC2, FC3) (_EL: RHEL-4, _EL3: RHEL-3)
  or
      bind-9.3.0-1 (FC4/Rawhide)
 I have now restored the corrupted SRC RPM:
    http://people.redhat.com/~jvdias/bind/bind-9.2.4-4.src.rpm
 This version should be available shortly through FC3 updates.

Comment 20 Graham Freeman 2004-12-22 01:42:44 UTC

This bug crept up on me with RHEL3-U5 when I removed
caching-nameservers at the same time.  (I run some authoritative
nameservers on a few RHEL-3 machines.)  Adding NETWORKING_IPV6=yes to
/etc/sysconfig/network; running service network restart; and then
service named restart solved the problem for me.  I think I have IPv6
routers in my network, but I'm not completely sure.

Comment 21 Graham Freeman 2004-12-22 01:48:24 UTC

Oops, I meant RHEL3-U4.  (Package bind-9.2.4-1_EL3 and related.)

Comment 22 Jason Vas Dias 2004-12-22 15:23:38 UTC

 The current bind-9.2.4-1_EL3 package in RHEL-3-U4 does not have the
 patch that stops timeouts when trying to contact servers whose 
 addresses are "unreachable" for any reason. The RHEL-3 bind version
 which does is bind-9.2.4-4_EL3, available from:
  http://people.redhat.com/~jvdias/bind/RHEL-3/
 This version is currently undergoing the QA process for inclusion
 in RHEL-3-U5 .

Comment 23 John Flanagan 2004-12-23 20:35:05 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-696.html

Comment 24 Trevor Cordes 2004-12-28 15:35:48 UTC

How does this work for FC3?  Will the errata be issued for FC or work
its way into the next bind errata for FC3?  Just wondering...

Comment 25 Pete Krawczyk 2005-06-18 14:59:02 UTC

This fix does not appear to have been propagated to RHEL4U1, epecially since the
channel shows a build date of October 18th, 2004.  Can someone confirm this
and/or apply the errata to the channel?

Comment 26 Tulio Salinas 2005-08-10 18:35:17 UTC

I've just installed RHEL 4 ES made an up2date and the bug is still there. After
doing the up2date the version is 9.2.4-2. The fix has not been propagated to RHEL4

I had to manually install version 9.3.1-1 from 
http://people.redhat.com/~jvdias/bind/RHEL-4/9.3.1-1/

Comment 27 Jason Vas Dias 2005-08-10 19:39:00 UTC

Yes, the fix for this bug was submitted to RHEL-4 in bind-9.2.4-4_EL4.
The current head revision of bind for RHEL-4 is bind-9.2.4-8_EL4 .

Neither of these versions was judged by our product management team
to fix issues of sufficient customer importance to warrant an upgrade.

The version I would recommend all RHEL-4 bind users to install is the
one you found:
http://people.redhat.com/~jvdias/bind/RHEL-4/9.3.1-1_EL4/

But we are not allowed to upgrade any RHEL package by a major release,
as this would violate our service level agreement.

We can only fix bugs in RHEL releases that have been specifically 
through Red Hat customer support.

If you want this bug to be fixed in RHEL-4, please contact your Red Hat
customer support representative through your subscription channel.

Comment 28 Dirk 2005-08-31 08:14:07 UTC

So for me it works to only add 

alias net-pf-10 off to /etc/modprobe.conf

To completely disable ipv6 

But i agree it would be better if Redhat soon release the new Package which fix
the Problem, because bind is a very common used Software, and Problems within
this should fixed imidately.

Note You need to log in before you can comment on or make changes to this bug.