From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0 Description of problem: named seems to be waiting several seconds (usually roughly four) between receiving a request for which it has no cache entry and issuing a request upstream to get an answer. The delay doesn't seem to be completely deterministic - it doesn't necessarily happen the first time after named is started, but it sets in fairly quickly and then is quite repeatable. Version-Release number of selected component (if applicable): bind-9.2.4-2 How reproducible: Sometimes Steps to Reproduce: 1. start named 2. start "tethereal -i any port 53" to watch what's going on 3. run "dig one.com @localhost", "dig two.com @localhost", ... Actual Results: for example, these are two successive lines from the tethereal capture: 132.329401 127.0.0.1 -> 127.0.0.1 DNS Standard query A four.com 136.334240 xx.xx.xx.xxx -> yyy.yy.yy.yy DNS Standard query A four.com Note the four second delay between the original query, from dig to named, and named's reaction, asking upstream to get an answer Expected Results: no significant delay Additional info: Some google searching found me messages about a different version which exhibited this symptom: http://www.monkey.org/openbsd/archive/misc/0411/msg00260.html
It sounds like your server is trying to contact other nameservers which do not respond and time out . Do you have 'forwarders { ... ; };' nameservers defined ? Note that only looking at packets using port 53 may not be sufficient - do you have : ' options { ... query-source port 53; ... }; ' defined in named.conf ? Unless you define this option, named will use any available port for the source UDP port from which queries originate; intervening firewalls may block all ports not in a certain list, so try setting the query-source option as above, and see if it fixes the problem. If it does not fix your problem, please run the following tcpdump command and append the output to this bug (or send it to jvdias) : tcpdump -nl -vv -i any src port 53 or dst port 53 2>&1 | tee /tmp/tcpdump.log Please append / send me the /tmp/tcpdump.log and your /etc/named.conf files.
Thanks for your prompt response. 1) I do not have any forwarders configured. 2) The filter I specified to tethereal "port 53", (with neither "src" nor "dst") means the same as your "src port 53 or dst port 53", according to both the documentation and my experience. 3) While I'm currently experiencing the problem with a modified named configuration, I have verified that this problem exists (at least for me) with the default configuration of bind + caching-nameserver, on a completely fresh install of FC3. I apologize for failing to mention that fact in the original bug.
Created attachment 107371 [details] log as requested As you'll see from the log, this was not made with the query source port restricted to 53, but it was made with precisely the tcpdump command you requested (to rule out my being wrong about tethereal filter syntax). Since no dns packets at all appeared for four seconds after the initial packet from dig to named (as in my original experiment), I contend that your external filtering hypothesis is exploded.
Created attachment 107372 [details] longer log, more as requested This log was made with the source port restricted to 53. If you look closely, you can see a pattern which might be interesting. I restarted named just before generating this log, then ran successive queries for foo<number>.com, with <number> running 0,1,2,... The first few queries behaved reasonably, but by the time I got to foo6 the delay had set in. It seems to persist once it appears, and it appears to show up relatively quickly. However, as in this example, the first few queries issued to a fresh process seem to do all right.
hi all. i believe this problem is actually identical to a fairly well documented (if you happen to be patient enough to search for it (or very clever)) problem with bind9 in multi-ip-protocol-stack (ipv4/ipv6) environments. i've found a feature request for EL3 (http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=134529) that suggests that this is understood, but i'm not convinced people see how broad a spectrum of people are impacted. for example, you'll find everyone in the MacOS X world complaining as well! it appears that recently the root nameservers (or the first level subordinates for com, etc) starting serving up IPV6 records, and that has either tickled some sort of bug in bind, the ip stack, whatever. i've not dug into the actual root cause yet, but i have followed other people's advice in employing the work-around (on macosx forums) and re-compiled bind without ipv6 support, and the symptoms go away... i think this might have something to do with the ipv6-over-ipv4 support automatically enabled by the kernel in core 3... quickly trying to disable didn't work because i couldn't seem to get the ipv6 module NOT to load via traditional means... for the time being, i've created some unofficial rpms for fedora core 3 that are available at http://dohm.dyndns.org/ that will address the immediate issue. the ONLY change i made was to pass --disable-ipv6 to the configure script when building these instead of --enable-ipv6.... otherwise, everything is identical. cheers. Peter Dohm
forgot to request being added to cc list... sorry 'bout that.
OK, I've now been able to duplicate this problem. Yes, some root servers are supplying their AAAA records before their A records, so named by default tries to query these root servers by their IPV6 addresses first. IF the IPv6 stack is not initialized, then these queries will time out. In FC3, IPv6 is not initialized by default. Initializing IPv6 stops the queries timing out and fixes the problem: In /etc/sysconfig/network, add the line: ' NETWORKING_IPV6=yes ' With NETWORKING_IPV6=yes in /etc/sysconfig/network, the timeouts do not occur . (Actually, I think the IP subsystem is then able to find out that there are no IPv6 routes and the initial IPv6 sends then fail immediately rather than timing out). I don't agree that we should totally disable IPv6 support in named as comment #5 suggests. BIND 9.3.0 has much better control of IPv6 features such as the '-4' and '-6' named options. BIND 9.3.0 is still being tested, but will be released shortly for FC4 / rawhide . Meanwhile, you can download it from: http://people.redhat.com/~jvdias/bind/9.3.0
Previous respondent was not added to CC list - now added OK. This bug is resolved: Adding NETWORKING_IPV6=yes to /etc/sysconfig/network prevents the queries to root nameservers with IPv6 addresses timing out.
This change (and a subsequent reboot) did not solve my problem.
nor did it for me. curious. jason, what might be different for you configuration? i'm using iptables to firewall, in case you think that might be somehow related...
Adding "NETWORKING_IPV6=yes" to /etc/sysconfig/network (and rebooting) does not solve the problem for me either. Also tried checking the IPV6 checkbox in system-config-network for eth0. Same result: problem stays. There often (always ?) is this 4s delay in bind before sending anything to the network when it is asked for an uncached entry. I am not using iptables. Jason: maybe you tested on an IPV6 connected network ?
*** Bug 141113 has been marked as a duplicate of this bug. ***
Yes, I found the problem - there was an IPV6 router on my network. Now, without setting up IPV6, the root servers that supply their IPV6 AAAA records before their A records cause 2 second timeouts per server AAAA address . This is because of the hardcoded 2 second timeout in resolver.c, which was not handling fatal send errors such as ENETUNREACH, but just let the timeouts expire. BIND 9.3.0 does exactly the same - the only option would be to disable IPv6 completely with the '-4' option. I have now developed a patch to fix this (appended), so that if a send causes a fatal error such as ENETUNREACH, the query timer is immediately expired, and the thread then tries the next address for the server when the timer fires within 10uS of the send error occurring. I raised ISC bug 13153 on this issue containing the patch which will hopefully be in subsequent BIND releases.
Created attachment 107533 [details] patch against the latest ISC BIND 9.3.0 source to fix this issue.
The fix will be in BIND 9.2.4-14 (FC3-U1, RHEL-4-U1, RHEL-3-U5) and in BIND 9.3.0-3 (FC4/rawhide) and can be downloaded from: http://people.redhat.com/~jvdias/bind/
Hi Jason, Can you please fix: http://people.redhat.com/~jvdias/bind/bind-9.2.4-4.src.rpm It contains only zeroes ..
FYI Jason: with your bind-9.2.4-4 RPMs the annoying delay is gone.
Jason, Your fix works for me also. Thank you.
Yes, as you figured out, the version numbers should have been: bind-9.2.4-4 (FC2, FC3) (_EL: RHEL-4, _EL3: RHEL-3) or bind-9.3.0-1 (FC4/Rawhide) I have now restored the corrupted SRC RPM: http://people.redhat.com/~jvdias/bind/bind-9.2.4-4.src.rpm This version should be available shortly through FC3 updates.
This bug crept up on me with RHEL3-U5 when I removed caching-nameservers at the same time. (I run some authoritative nameservers on a few RHEL-3 machines.) Adding NETWORKING_IPV6=yes to /etc/sysconfig/network; running service network restart; and then service named restart solved the problem for me. I think I have IPv6 routers in my network, but I'm not completely sure.
Oops, I meant RHEL3-U4. (Package bind-9.2.4-1_EL3 and related.)
The current bind-9.2.4-1_EL3 package in RHEL-3-U4 does not have the patch that stops timeouts when trying to contact servers whose addresses are "unreachable" for any reason. The RHEL-3 bind version which does is bind-9.2.4-4_EL3, available from: http://people.redhat.com/~jvdias/bind/RHEL-3/ This version is currently undergoing the QA process for inclusion in RHEL-3-U5 .
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-696.html
How does this work for FC3? Will the errata be issued for FC or work its way into the next bind errata for FC3? Just wondering...
This fix does not appear to have been propagated to RHEL4U1, epecially since the channel shows a build date of October 18th, 2004. Can someone confirm this and/or apply the errata to the channel?
I've just installed RHEL 4 ES made an up2date and the bug is still there. After doing the up2date the version is 9.2.4-2. The fix has not been propagated to RHEL4 I had to manually install version 9.3.1-1 from http://people.redhat.com/~jvdias/bind/RHEL-4/9.3.1-1/
Yes, the fix for this bug was submitted to RHEL-4 in bind-9.2.4-4_EL4. The current head revision of bind for RHEL-4 is bind-9.2.4-8_EL4 . Neither of these versions was judged by our product management team to fix issues of sufficient customer importance to warrant an upgrade. The version I would recommend all RHEL-4 bind users to install is the one you found: http://people.redhat.com/~jvdias/bind/RHEL-4/9.3.1-1_EL4/ But we are not allowed to upgrade any RHEL package by a major release, as this would violate our service level agreement. We can only fix bugs in RHEL releases that have been specifically through Red Hat customer support. If you want this bug to be fixed in RHEL-4, please contact your Red Hat customer support representative through your subscription channel.
So for me it works to only add alias net-pf-10 off to /etc/modprobe.conf To completely disable ipv6 But i agree it would be better if Redhat soon release the new Package which fix the Problem, because bind is a very common used Software, and Problems within this should fixed imidately.