Bug 65470 - Bind stop responding
Summary: Bind stop responding
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: bind
Version: 7.3
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Daniel Walsh
QA Contact: Ben Levenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-05-24 21:18 UTC by Milan Kerslager
Modified: 2007-04-18 16:42 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2003-03-04 16:04:15 UTC
Embargoed:


Attachments (Terms of Use)

Description Milan Kerslager 2002-05-24 21:18:43 UTC
I have a network with 150 PC (10Mbps connectivity). After update to RH 7.3, DNS 
server stop responding within 1 day or so. I have to manually restart daemon to 
be able resolve queries. I can see sleeping processes (the daemon is runnig).

The system is fully up-to-date and have been freshly installed.

This report is to be here for anyone who have the same problem. After next hang 
I will try to collect more informations. I have an automatic script for 
restarting named so I need to wait for some night (without on-line users).

Comment 1 Martin Blom 2002-05-25 12:37:07 UTC
I've seen the same thing. Here's what's in my log:

May 20 17:42:38 matilda named[28804]: message.c:809: REQUIRE(*rdataset == ((void *)0)) failed
May 20 17:42:38 matilda named[28804]: exiting (due to assertion failure)


Comment 2 Warren Togami 2002-06-08 10:44:14 UTC
This sounds like it may be this recent BIND Denial of Service attack.
http://www.cert.org/advisories/CA-2002-15.html

Please try this Red Hat Update
http://rhn.redhat.com/errata/RHSA-2002-105.html

Comment 3 Milan Kerslager 2002-06-12 11:49:04 UTC
I tryed update on two machines but it seems that the problem arised. I had to 
add 'service named reload' to /etc/cron.hourly (not daily as before).

There are no messages in the log files, command 'ps xau' shows no oddities but 
there is no name resolving (only internally defined zone works). As I wrote 
above, reloading solve the problem.

Next time I will see the problem, I will try to use tcpdump to see if 
communication with root server works or not.

Comment 4 Milan Kerslager 2002-06-12 12:03:46 UTC
One more thing - both DNS servers that does not work well are connected to the 
same provider. I have one more RH 7.3 with the same version of packages but 
there is no problem here (ie. no daemon hangs).

Will try to dig around this.

Comment 5 Milan Kerslager 2002-07-27 08:33:33 UTC
It seems that when a record expired in cache, daemon do not ask for new value 
(not for all expired records). When I tryed resolve expired record, named did 
not generated any question (traced via tcpdump). After reloading named I could 
see a lot of queries and everything was ok.

Will try to compile new named from current Beta.

Comment 6 Milan Kerslager 2002-08-08 20:58:30 UTC
Log from tcpdump -i lo port domain:

[root@neptun root]# tcpdump -n -i lo port domain
tcpdump: listening on lo
22:53:11.666833 127.0.0.1.51346 > 127.0.0.1.domain:  11664+ A? www.ebanka.cz. 
(31) (DF)
22:53:11.668170 127.0.0.1.domain > 127.0.0.1.51346:  11664 ServFail 0/0/0 (31) 
(DF)
22:53:27.883474 127.0.0.1.51346 > 127.0.0.1.domain:  25056+ A? www.seznam.cz. 
(31) (DF)
22:53:27.884547 127.0.0.1.domain > 127.0.0.1.51346:  25056 2/2/0 A 212.80.76.3, 
(97) (DF)
22:53:34.124072 127.0.0.1.51346 > 127.0.0.1.domain:  20586+ A? www.ebanka.cz. 
(31) (DF)
22:53:34.125244 127.0.0.1.domain > 127.0.0.1.51346:  20586 ServFail 0/0/0 (31) 
(DF)

A you see, www.seznam.cz has been resolved bud www.ebanka.cz not. This record 
(www.ebanka.cz) expired from cache. Named did not generated any outgoing 
request. Not all expired records will fail...

Comment 7 Milan Kerslager 2002-08-08 21:02:54 UTC
A few minutes after this www.ebanka.cz works again. Named resolved query as new:

# dig www.ebanka.cz

; <<>> DiG 9.2.1 <<>> www.ebanka.cz
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55297
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 2, ADDITIONAL: 0

;; QUESTION SECTION:
;www.ebanka.cz.                 IN      A

;; ANSWER SECTION:
www.ebanka.cz.          64000   IN      A       194.228.112.55
www.ebanka.cz.          64000   IN      A       195.250.142.3
www.ebanka.cz.          64000   IN      A       212.67.66.162
www.ebanka.cz.          64000   IN      A       62.168.6.2

;; AUTHORITY SECTION:
ebanka.cz.              64000   IN      NS      ms.ebanka.cz.
ebanka.cz.              64000   IN      NS      ns.ebanka.cz.

;; Query time: 24 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Aug  8 22:59:23 2002
;; MSG SIZE  rcvd: 129


Comment 8 Need Real Name 2002-09-02 00:16:11 UTC
I have seen the same problem.  I have not been able to track it down yet, but this is what I have found so far:

It appears to be a memory leak related to non-authoritative queries, particularly large ones.  The primary symptom is that cached
non-authoritative data gets corrupted.  The larger the query, the more likely it is to get corrupted.  The one I see happen first most
of the time is hotmail since they have so many MX records.

When the problem has arisen, a client trying to send mail to hotmail will sit and hang.  What is happening is the client does a
EHLO, MAIL FROM and RCPT TO.  The mail server tries to resolv the MX records for hotmail (in the RCPT TO).  Sniffing the
transaction shows that the DNS server is sending garbled results back to the mail server.  The mail server cannot grok these
results, so it does a RST and tries again.  This continues and the mail server never sends back an OK to the client in response it
its RCPT TO.  Even though this appears to be a mail issue, restarting named on the DNS server fixes the probelm - for about a day.

One theory I have is that the corruption may happen when a non-authoritative record expires and gets requeried.  Domains like
hotmail that get constant use will show this problem more quickly (in about a day, consistently).

I have the dig for hotmail just after the daemon restart.  I will compare that to one I do when the problem happens again and
post both here.


Comment 9 Milan Kerslager 2002-09-02 18:31:39 UTC
I sniffed network when this bug appeared. Server responds Servfail:

# tcpdump -n port domain
tcpdump: listening on eth0
20:26:20.634274 195.113.159.169.40761 > 195.113.159.1.domain:  45007+ A? 
www.seznam.cz. (31) (DF)
20:26:20.634274 195.113.159.1.domain > 195.113.159.169.40761:  45007 ServFail 
0/0/0 (31) (DF)


This is after servise named reload:

# tcpdump -n port domain
tcpdump: listening on eth0
20:26:51.264274 195.113.159.169.40761 > 195.113.159.1.domain:  38082+ A? 
www.seznam.cz. (31) (DF)
20:26:51.264274 195.113.159.1.domain > 195.113.159.169.40761:  38082 2/2/0 A 
212.80.76.3, (97) (DF)

I don't have full dump of the replying datagram. Will try dump it next time.

Comment 10 Milan Kerslager 2003-01-06 12:40:47 UTC
The workaround is to add forwarders servers to Bind config file (ie. NS from ISP).

My /etc/named.conf:
====================
options {
        directory "/var/named";
        forwarders {
                147.230.16.1;
                195.113.167.1;
        };
};

I did not see the bug anymore. But this is not a real fix...

Comment 11 Daniel Walsh 2003-03-04 16:04:15 UTC
Fixed in latest release. We believe the problem is fixed with new thread package.

Comment 12 Milan Kerslager 2006-06-05 21:41:43 UTC
This issue has never been fixed in RHL 7.3, see bug #194128 (FedoraLegacy bug now).


Note You need to log in before you can comment on or make changes to this bug.