From Bugzilla Helper: User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2smp i686) Description of problem: Everything is configured correctly and running fine. However, after a while running (anywhere from 8-10 hours to 3 or 4 days). Named stops resolving the domain name for the mail server of my ISP. getaddrinfo returns status code 2, nslookup returns SERVFAIL, and fetchmail stops retreiving mail. However, other addresses continue to resolve correctly (so named is still running correctly). To fix the problem, you must stop and restart named. Once you do, it starts resolving the name correctly again. It is as if the ISP's DNS server hickups (which does happen quite regularly) and then named caches this bad state and never bothers rechecking their server. This is new to bind 9.1.0, as the older versions 8.7 - 8.9 of the older releases never had this problem. How reproducible: Always Steps to Reproduce: 1.Start named 2.Start a program that resolves a DNS name at a regular interval (say every 5 minutes) such as fetchmail. 3.Let it run until their DNS hickups (URL in this form is prime example) -- Anywhere from 8-10 hours to 4 or 5 days. 4.Note status=2 from getaddrinfo and SERVFAIL from nslookup and similar 5.Restart named 6.Problem goes away Actual Results: With certain domain names, named stops resolving the name until it is restarted. Restarting named resets it and causes it to start resolving correctly again. Expected Results: Named should not get stuck with a bad-state for a domain name in its cache (which is what it is acting like is happening). Additional info: bind-9.1.0-10 nslookup output (when it stops working): ** server can't find mail.wtrlo1.ia.home.com.: SERVFAIL fetchmail output (when it stops working): fetchmail: fetchmail: getaddrinfo(mail.wtrlo1.ia.home.com.pop3) fetchmail: Query status=2 (SOCKET)
I can't reproduce this anywhere. Please check if this still happens with 9.1.3-0.rc2.2 (from rawhide) and let me know if this fixes the problem for you.
Bind package 9.1.3-0.rc2.2 from rawhide appears to have solved the problem. Since upgrading, I've encountered one getaddrinfo status=2 error (which I'm sure is problems with their server), but on the next lookup attempt it worked correctly and didn't get stuck with the error like it did with 9.1.0-10. Thanks.
I got bitten by this bug and at least for my setup the problem is so big that I think an errata is needed. I've tried to debug the problme further, but bind is not very debug friendly and since the I cannot reproduce the problem until after some hours after restart debugging is really difficult. I've managed to get some logging output (resolve category, severity 3) from named which I will attach
Created attachment 24312 [details] Debug output of a good resolve
Created attachment 24313 [details] Debug output of a broken resolve