Red Hat Bugzilla – Bug 47086
getaddrinfo status=2, SERVFAIL, on some addresses over time until named is restarted
Last modified: 2007-04-18 12:34:24 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2smp i686)
Description of problem:
Everything is configured correctly and running fine. However, after a
while running (anywhere from 8-10 hours to 3 or 4 days). Named stops
resolving the domain name for the mail server of my ISP. getaddrinfo
returns status code 2, nslookup returns SERVFAIL, and fetchmail stops
retreiving mail. However, other addresses continue to resolve correctly
(so named is still running correctly).
To fix the problem, you must stop and restart named. Once you do, it
starts resolving the name correctly again.
It is as if the ISP's DNS server hickups (which does happen quite
regularly) and then named caches this bad state and never bothers
rechecking their server.
This is new to bind 9.1.0, as the older versions 8.7 - 8.9 of the older
releases never had this problem.
Steps to Reproduce:
2.Start a program that resolves a DNS name at a regular interval (say every
5 minutes) such as fetchmail.
3.Let it run until their DNS hickups (URL in this form is prime example) --
Anywhere from 8-10 hours to 4 or 5 days.
4.Note status=2 from getaddrinfo and SERVFAIL from nslookup and similar
6.Problem goes away
Actual Results: With certain domain names, named stops resolving the name
until it is restarted. Restarting named resets it and causes it to start
resolving correctly again.
Expected Results: Named should not get stuck with a bad-state for a domain
name in its cache (which is what it is acting like is happening).
nslookup output (when it stops working):
** server can't find mail.wtrlo1.ia.home.com.: SERVFAIL
fetchmail output (when it stops working):
fetchmail: fetchmail: getaddrinfo(mail.wtrlo1.ia.home.com.pop3)
fetchmail: Query status=2 (SOCKET)
I can't reproduce this anywhere.
Please check if this still happens with 9.1.3-0.rc2.2 (from rawhide) and let
me know if this fixes the problem for you.
Bind package 9.1.3-0.rc2.2 from rawhide appears to have solved the problem. Since upgrading, I've encountered one getaddrinfo status=2 error (which
I'm sure is problems with their server), but on the next lookup attempt it worked correctly and didn't get stuck with the error like it did with 9.1.0-10.
I got bitten by this bug and at least for my setup the problem is so big that I
think an errata is needed. I've tried to debug the problme further, but bind is
not very debug friendly and since the I cannot reproduce the problem until after
some hours after restart debugging is really difficult. I've managed to get some
logging output (resolve category, severity 3) from named which I will attach
Created attachment 24312 [details]
Debug output of a good resolve
Created attachment 24313 [details]
Debug output of a broken resolve