Description of Problem: The latest bind erratum of 8.2.3 for both RHL 6.2 and 7.0 is totally broken. Once installed/updated, just using the default configuration supplied with bind plus the caching-nameserver package it is totally unreliable and will start failing within 5 minutes. Version-Release number of selected component (if applicable): RHL 6.2 bind-8.2.3-0.6.x RHL 7.0 bind-8.2.3-1 How Reproducible: 100% Steps to Reproduce: 1. Install a clean RHL 6.2 or 7.0 system 2. up2date -u to all current erratum released 3. be sure bind and caching-nameserver are installed and are current 4. Edit /etc/resolv.conf and add "nameserver 127.0.0.1" 5. Wether or not ISP nameservers are also listed the server will fail easily within 5-10 minutes tops. Usually reproduceable in less than a minute. Actual Results: Random failures, examples to follow. Expected Results: Proper DNS resolution. Additional Information: This problem has been occuring for a while on and off now, maybe a few months, hard to pinpoint exactly. However, I recently up2date'd my machines, and after that I've had nothing but nonstop DNS total failure. I've got to restart named 2,3, 6, 10 times in a row, click on a URL, failed.. again, failed, again, finally DNS resolves the page. Within a minute, that URL wont load again. Other sites work fine. Then they disappear and one that wouldn't work before now works again. I first noticed this problem strongly when trying to use google after the update. When I go to www.google.com, it redirects me to www.google.ca, which then says no such host can be found. DOing a manual lookup of the IP of www.google.com, and putting www.google.ca pointing to the IP of .com acted as a workaround. The sites I consistently use to reproduce this are the ones I most often use. www.google.com, www.google.ca (wont work from outside .ca), irc.lame.org, irc.openprojects.net, www.redhat.com, bugzilla.redhat.com
pts/0 root@gw:/# nslookup irc.redhat.com Server: gw.capslock.lan Address: 192.168.1.1 *** gw.capslock.lan can't find irc.redhat.com: Server failed pts/0 root@gw:/# service named restart Shutting down named: [ OK ] Starting named: [ OK ] pts/0 root@gw:/# nslookup irc.redhat.com Server: gw.capslock.lan Address: 192.168.1.1 Non-authoritative answer: Name: irc.openprojects.net Addresses: 64.28.67.98, 207.106.22.229, 216.53.71.65, 198.186.203.27 Aliases: irc.redhat.com pts/0 root@gw:/# nslookup irc.redhat.com Server: gw.capslock.lan Address: 192.168.1.1 *** gw.capslock.lan can't find irc.redhat.com: Server failed pts/0 root@gw:/# nslookup www.redhat.com Server: gw.capslock.lan Address: 192.168.1.1 Non-authoritative answer: Name: www.redhat.com Addresses: 216.148.218.197, 216.148.218.195 pts/0 root@gw:/# nslookup irc.lame.org Server: gw.capslock.lan Address: 192.168.1.1 *** gw.capslock.lan can't find irc.lame.org: Server failed
Sometimes I get lookups working for 2 to 3 minutes, maybe as many as 5 to 10 minutes. Leaving the machine completely idle, and waiting 2-3 minutes then hitting up-arrow-enter of the last lookup is all that need be done. If it works one time, it will fail at some point. If I cant get one to fail, I try a different host, and generally one fails right away. Restarting bind does not guarantee it will work right away either. It might work, or might fail immediately. Sometimes 2-3 restarts are needed.
I just rebuild bind 9.1.0 from RHL 7.1 on RHL 7.0 after removing the dependancy on tar, and replacing tar -j with bzcat piped to tar... Results.... same thing. So, it appears this might be more than a bind issue, but perhaps a library issue or somesuch. I dont know enough about bind to debug the issue further, but I've discussed it with a few other people now too, and they're having similar problems. :o/
That didn't make much sense.. considering I am debugging the issue further... Try to find out more tomorrow.
Any news on this one ?
Is this still a open bug or did you figure out the problem? It was just assigned to me. Dan
This problem drove me completely nuts to the point where 3 DNS experts (one of which was Bryce) couldn't fix it, and couldn't determine what the problem could be - complete bafflement. As such, I just stopped using bind entirely, disabled local DNS, and started using /etc/hosts on all machines mirrored via cron, the good old fashioned 1970's way. I pointed DNS to my ISP's servers, and all problems went away for quite some time. Many many months later, I began having new DNS problems, in particular in mozilla, and oddly - only from certain machines on my network. Frustration once again, and with many of the same symptoms as the problem described here. I was essentially unable to use the Internet properly while my whole LAN seemed to work fine. I began suspecting something wrong on my firewall perhaps. I investigated the configuration of pretty much everything on my firewall and tested many things, all to no avail. Couldn't find any problems. Then I checked /var/log/messages, and scanned it for anything even remotely possible to be the culprit of the trouble I was having. Lo and behold..... messages.2.gz:Oct 16 10:28:34 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34844). messages.2.gz:Oct 16 10:29:39 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34844). messages.2.gz:Oct 16 10:33:37 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34844). messages.2.gz:Oct 16 10:34:12 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34846). messages.2.gz:Oct 16 10:37:39 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34846). messages.2.gz:Oct 16 10:37:48 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34846). messages.2.gz:Oct 16 10:39:53 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34848). messages.2.gz:Oct 16 10:39:55 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34848). messages.2.gz:Oct 16 10:39:58 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34849). messages.2.gz:Oct 16 10:40:00 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34848). messages.2.gz:Oct 16 10:45:36 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34849). messages.2.gz:Oct 16 10:45:43 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34857). messages.2.gz:Oct 16 10:46:17 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=34860). IP Masquerading was failing for UDP due to a filled masquerade table. But for some odd reason, *only* on *certain* machines. ARRRRGHHHHH!! In other words, a 2.2.x kernel bug (IMHO). The solution was to reboot the machine. The problem went away for a couple months and returned, and another reboot solved it again. I do not know explicitly if this kernel bug/issue is/was responsible for the bind issue I am reporting in this report, however it is entirely likely that it is/was the problem at that point in time as well. Since nobody else seems to have experienced this problem, I am considering it a local issue now, due to the specifics of my own kernel (which is *cough* homebrew *cough* from stock kernel.org sources). I have deprecated my trusty 486-DX2/66 now, and plan on putting a newer RHL 8.0 capable machine in its place with iptables, and a stock Red Hat kernel rather than the minimalized kernel I had no choice but to use on the 12Mb 486. ;o) In short, I consider this issue closed due to kernel funkification. Closing as WORKSFORME now.