I have a network with 150 PC (10Mbps connectivity). After update to RH 7.3, DNS server stop responding within 1 day or so. I have to manually restart daemon to be able resolve queries. I can see sleeping processes (the daemon is runnig). The system is fully up-to-date and have been freshly installed. This report is to be here for anyone who have the same problem. After next hang I will try to collect more informations. I have an automatic script for restarting named so I need to wait for some night (without on-line users).
I've seen the same thing. Here's what's in my log: May 20 17:42:38 matilda named[28804]: message.c:809: REQUIRE(*rdataset == ((void *)0)) failed May 20 17:42:38 matilda named[28804]: exiting (due to assertion failure)
This sounds like it may be this recent BIND Denial of Service attack. http://www.cert.org/advisories/CA-2002-15.html Please try this Red Hat Update http://rhn.redhat.com/errata/RHSA-2002-105.html
I tryed update on two machines but it seems that the problem arised. I had to add 'service named reload' to /etc/cron.hourly (not daily as before). There are no messages in the log files, command 'ps xau' shows no oddities but there is no name resolving (only internally defined zone works). As I wrote above, reloading solve the problem. Next time I will see the problem, I will try to use tcpdump to see if communication with root server works or not.
One more thing - both DNS servers that does not work well are connected to the same provider. I have one more RH 7.3 with the same version of packages but there is no problem here (ie. no daemon hangs). Will try to dig around this.
It seems that when a record expired in cache, daemon do not ask for new value (not for all expired records). When I tryed resolve expired record, named did not generated any question (traced via tcpdump). After reloading named I could see a lot of queries and everything was ok. Will try to compile new named from current Beta.
Log from tcpdump -i lo port domain: [root@neptun root]# tcpdump -n -i lo port domain tcpdump: listening on lo 22:53:11.666833 127.0.0.1.51346 > 127.0.0.1.domain: 11664+ A? www.ebanka.cz. (31) (DF) 22:53:11.668170 127.0.0.1.domain > 127.0.0.1.51346: 11664 ServFail 0/0/0 (31) (DF) 22:53:27.883474 127.0.0.1.51346 > 127.0.0.1.domain: 25056+ A? www.seznam.cz. (31) (DF) 22:53:27.884547 127.0.0.1.domain > 127.0.0.1.51346: 25056 2/2/0 A 212.80.76.3, (97) (DF) 22:53:34.124072 127.0.0.1.51346 > 127.0.0.1.domain: 20586+ A? www.ebanka.cz. (31) (DF) 22:53:34.125244 127.0.0.1.domain > 127.0.0.1.51346: 20586 ServFail 0/0/0 (31) (DF) A you see, www.seznam.cz has been resolved bud www.ebanka.cz not. This record (www.ebanka.cz) expired from cache. Named did not generated any outgoing request. Not all expired records will fail...
A few minutes after this www.ebanka.cz works again. Named resolved query as new: # dig www.ebanka.cz ; <<>> DiG 9.2.1 <<>> www.ebanka.cz ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55297 ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 2, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.ebanka.cz. IN A ;; ANSWER SECTION: www.ebanka.cz. 64000 IN A 194.228.112.55 www.ebanka.cz. 64000 IN A 195.250.142.3 www.ebanka.cz. 64000 IN A 212.67.66.162 www.ebanka.cz. 64000 IN A 62.168.6.2 ;; AUTHORITY SECTION: ebanka.cz. 64000 IN NS ms.ebanka.cz. ebanka.cz. 64000 IN NS ns.ebanka.cz. ;; Query time: 24 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Thu Aug 8 22:59:23 2002 ;; MSG SIZE rcvd: 129
I have seen the same problem. I have not been able to track it down yet, but this is what I have found so far: It appears to be a memory leak related to non-authoritative queries, particularly large ones. The primary symptom is that cached non-authoritative data gets corrupted. The larger the query, the more likely it is to get corrupted. The one I see happen first most of the time is hotmail since they have so many MX records. When the problem has arisen, a client trying to send mail to hotmail will sit and hang. What is happening is the client does a EHLO, MAIL FROM and RCPT TO. The mail server tries to resolv the MX records for hotmail (in the RCPT TO). Sniffing the transaction shows that the DNS server is sending garbled results back to the mail server. The mail server cannot grok these results, so it does a RST and tries again. This continues and the mail server never sends back an OK to the client in response it its RCPT TO. Even though this appears to be a mail issue, restarting named on the DNS server fixes the probelm - for about a day. One theory I have is that the corruption may happen when a non-authoritative record expires and gets requeried. Domains like hotmail that get constant use will show this problem more quickly (in about a day, consistently). I have the dig for hotmail just after the daemon restart. I will compare that to one I do when the problem happens again and post both here.
I sniffed network when this bug appeared. Server responds Servfail: # tcpdump -n port domain tcpdump: listening on eth0 20:26:20.634274 195.113.159.169.40761 > 195.113.159.1.domain: 45007+ A? www.seznam.cz. (31) (DF) 20:26:20.634274 195.113.159.1.domain > 195.113.159.169.40761: 45007 ServFail 0/0/0 (31) (DF) This is after servise named reload: # tcpdump -n port domain tcpdump: listening on eth0 20:26:51.264274 195.113.159.169.40761 > 195.113.159.1.domain: 38082+ A? www.seznam.cz. (31) (DF) 20:26:51.264274 195.113.159.1.domain > 195.113.159.169.40761: 38082 2/2/0 A 212.80.76.3, (97) (DF) I don't have full dump of the replying datagram. Will try dump it next time.
The workaround is to add forwarders servers to Bind config file (ie. NS from ISP). My /etc/named.conf: ==================== options { directory "/var/named"; forwarders { 147.230.16.1; 195.113.167.1; }; }; I did not see the bug anymore. But this is not a real fix...
Fixed in latest release. We believe the problem is fixed with new thread package.
This issue has never been fixed in RHL 7.3, see bug #194128 (FedoraLegacy bug now).