Bug 584356

Summary: Bind fails with assertion
Product: Red Hat Enterprise Linux 5 Reporter: Daniel Senie <dts>
Component: bindAssignee: Adam Tkac <atkac>
Status: CLOSED INSUFFICIENT_DATA QA Contact: qe-baseos-daemons
Severity: high Docs Contact:
Priority: medium    
Version: 5.4CC: dts, fdewaley, ovasik, stbulicek
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-12 17:03:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 798457, 743405    
Attachments:
Description Flags
gdb output of named core none

Description Daniel Senie 2010-04-21 13:28:41 UTC
BIND fails when an "rndc reload" is issued by automated script, but only on occasion.

The relevant lines in the message log:

Apr 21 07:55:02 briar01 named[26939]: zt.c:143: REQUIRE((((zt) != ((void *)0)) && (((const isc__magic_t *)(zt))->magic == ((('Z') << 24 | ('T') << 16 | ('b') << 8 | ('l')))))) failed
Apr 21 07:55:02 briar01 named[26939]: exiting (due to assertion failure)


Version-Release number of selected component (if applicable):

bind-chroot-9.3.6-4.P1.el5_4.2
bind-9.3.6-4.P1.el5_4.2
bind-libs-9.3.6-4.P1.el5_4.2
bind-utils-9.3.6-4.P1.el5_4.2


Seems to trip every 10 days or so. Not easy to reproduce at will.

Looks like a null pointer dereference check is tripping.

Comment 1 Adam Tkac 2010-04-26 09:03:40 UTC
Would it be possible to attach a backtrace, please? You can obtain it this way:

1. add ENABLE_ZONE_WRITE=yes to your /etc/sysconfig/named
2. run "setsebool named_write_master_zones 1"
3. run `service named restart` (do NOT run "killall -HUP named" or "rndc reload") and wait for a crash. There should be new file in the /var/named directory, called core.XXXX.
4. install bind-debuginfo package (http://kbase.redhat.com/faq/docs/DOC-9908)
5. run "gdb /usr/sbin/named /var/named/core.XXXX"
6. inside gdb session run "t a a bt full"
7. attach gdb output

Make sure you attach gdb output as a "private" attachment if it contains any security sensitive information.

Thank you in advance.

Comment 2 Daniel Senie 2010-05-24 05:26:39 UTC
We have experienced this several times now. Takes several days at least for it to break.

Attaching gdb session output separtely.

Comment 3 Daniel Senie 2010-05-24 05:29:47 UTC
Created attachment 416029 [details]
gdb output of named core

Comment 4 Daniel Senie 2010-12-01 19:51:39 UTC
Another languishing bug in a critical service. We are moving our name servers away from RedHat to other distros, so as to get versions of BIND that don't crash. We've had a cron job checking to make sure BIND is running, and kicking it when it's not as a work-around since May. It's kind of critical and mission-critical to have one's name servers actually running and reliable. Guess RedHat disagrees. Oh well.

Comment 5 RHEL Program Management 2011-01-11 20:51:31 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 6 RHEL Program Management 2011-01-11 22:19:29 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 9 Daniel Senie 2011-03-25 13:34:05 UTC
Eleven months since reported, and we're still running a Perl script every 5 minutes to ensure BIND is running. That script continues to save us regularly, a few times a month now, when BIND falls over dead. We've been migrating much of our DNS to non-RedHat systems to find functional stability.

Comment 11 Adam Tkac 2011-04-04 15:39:16 UTC
(In reply to comment #3)
> Created attachment 416029 [details]
> gdb output of named core

Unfortunately this backtrace is not sufficient to fix this issue.

Can you please try to get more information this way?

1. Put following to your named.conf:

logging {
        channel default_debug {
                file "data/named.run" versions 3 size 1m;
                print-category yes;
                severity debug 99;
        };
};

2. Put OPTIONS='-d99' to /etc/sysconfig/named

3. restart named and when it crashes then please attach /var/named/data/named.run* files.

Thank you in advance.

Comment 12 Daniel Senie 2011-04-04 17:11:37 UTC
Happy to add some more debugging output, and pleased that there is finally some interest in getting to the bottom of this. However, there's one challenge. Since I have a script that restarts the daemon automatically when it falls over, I will need to add to that kick script something that saves off the files you want, or else they'll get stomped out of existence. I wanted to ask if there's an alternative, such as a way to specify the file name with a unique ID (e.g. PID) in the third line of the example above. If not, I'll see what I can do in the Perl code to save off the file.

Because this is a mission-critical service, having the daemon just not be running until I get a chance to copy the data file out of the way and kicking the thing manually is NOT an option.

Comment 13 Adam Tkac 2011-04-06 07:36:24 UTC
(In reply to comment #12)
> Happy to add some more debugging output, and pleased that there is finally some
> interest in getting to the bottom of this. However, there's one challenge.
> Since I have a script that restarts the daemon automatically when it falls
> over, I will need to add to that kick script something that saves off the files
> you want, or else they'll get stomped out of existence. I wanted to ask if
> there's an alternative, such as a way to specify the file name with a unique ID
> (e.g. PID) in the third line of the example above. If not, I'll see what I can
> do in the Perl code to save off the file.
> 
> Because this is a mission-critical service, having the daemon just not be
> running until I get a chance to copy the data file out of the way and kicking
> the thing manually is NOT an option.

Unfortunately there is no way how to append PID number to the debug log files. Your script must be extended a little, for example this way (note I haven't tested code below). Put something like this right before you start crashed named.

#!/usr/bin/perl

system(
if ! [ -d /debuglogfiles ]; then
  mkdir /debuglogfiles;
  cp /var/named/data/named.run* /debuglogfiles;
fi;
)

Then simply attach named.run* files.

Comment 14 RHEL Program Management 2011-05-31 14:03:55 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 15 Daniel Senie 2011-05-31 14:13:23 UTC
Last I checked, Bind was critical infrastructure, and that's what this bug, reported 13 months ago, is all about. I am grateful RedHat has stated this policy today that it will not fix this bug in critical infrastructure. It reconfirms a strategy we are undertaking to move all servers running critical components such as DNS away from RedHat products and to another distribution with a vendor that is doing a far better job of fixing bugs of this sort. We started with RedHat with RHL 2.1, a great many years ago when the company was small. Sorry to say goodbye after all this time.

Comment 21 RHEL Program Management 2012-04-02 10:28:01 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 23 RHEL Program Management 2012-06-12 01:11:53 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 24 Adam Tkac 2013-03-12 17:03:44 UTC
Since there was no response for needinfo for more that 18 months, I'm closing this issue.