144586 – named crashes unpredictably and mysteriously after upgrade to Update 4

Bug 144586 - named crashes unpredictably and mysteriously after upgrade to Update 4

Summary: named crashes unpredictably and mysteriously after upgrade to Update 4

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	bind
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Martin Stransky
QA Contact:	Ben Levenson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-08 21:58 UTC by Ronald Cole
Modified:	2007-11-30 22:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:	U8
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-08-09 06:34:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ronald Cole 2005-01-08 21:58:20 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.3)
Gecko/20040924

Description of problem:
Yes, I know it's a bad summary, but I cannot reproduce it at will, nor
predict when or why it will fail because it leaves no indication of
what went wrong in a log file anywhere.  It just dies suddenly and
unexpectedly!  Named had been running flawlessly for me since I
installed it over a year ago until I upgraded to RHEL3ES Update 4.

Anyway, I open this bug so that others who've had it mysteriously
crash on them after the upgrade can add to the braintrust required to
resolve it.

Version-Release number of selected component (if applicable):
bind-9.2.4-5_EL3

How reproducible:
Sometimes

Steps to Reproduce:
1. Upgrade bind (and maybe caching-nameserver) to Update 4.
2.
3.
    

Actual Results:  Named will die after a day or two, unpredictably and
mysteriously.

Expected Results:  Named to run and run and run, as it used to before
Update 4.

Additional info:

Also updated at the time was caching-nameserver to 7.3-3_EL3.

Comment 1 Jason Vas Dias 2005-01-09 01:08:10 UTC

 What kind of nameserver are you running ? Is it a caching only
 nameserver ?   If not, you must have replaced the caching-nameserver
 config files with your own. It would be most helpful in diagnosing
 the problem to see these config files - if you don't want to append
 them to this bug where they can be publicly viewed, please send them
 to jvdias - I'll treat them as confidential .
 If it is serving external clients, what is the load like on the 
 server ? Doing an 'rndc stats' after the server is experiencing 
 normal load for at least an hour and appending the stats file
 to the bug or sending it to me would be helpful - the stats file
 is set by the caching-nameserver default config files to be:
      statistics-file "data/named.stats.txt"; 

 Are there really no 'named:' log messages in /var/log/messages from
 just before the named process exits, such as 'out of memory' errors ?
 If there are 'out of memory' errors, you can adjust the 'ulimit' 
 of the named process - do you set the 'stacksize' or 'datasize'
 named.conf 'options{ }' ?

 If not, it may be that named is experiencing a memory access 
 violation, and would dump a core file in
  $ROOTDIR/var/named
 where $ROOTDIR is set in /etc/sysconfig/named, IF core file
 generation was not disabled by default, which it is.
 Please edit the /etc/profile file around line 28, to read:
"
# No core files by default
# ulimit -S -c 0 > /dev/null 2>&1
ulimit -c unlimited >/dev/null 2>&1
" 
 and edit /etc/init.d/named, around line 52, to read:
"
	if [ $conf_ok -eq 1 ]; then	   
#	   daemon /usr/sbin/named -u named ${OPTIONS};
           /sbin/ulimit -c unlimited
           /usr/sbin/named -u named ${OPTIONS};
	   RETVAL=$?;
	else
"

 Then if named generates a memory access violation, there will be
 a $ROOTDIR/var/named/core.$pid file, where $pid is the process
 id of the named process which crashed - if there are any of these
 files, please gzip them and append them to this bug or send them
 to me.

 Gathering some named logging information would also be useful .
 If the load on the server is light, and you have loads of disk space
 - AT LEAST 1GB -
 on the $ROOTDIR/var/named partition, after the named process starts
 do a 
    'rndc trace 99' .
 Then named will create a file in $ROOTDIR/var/named called 
 'named.run', containing debug log messages. If the crash is then
 observed, do a 
  'tail -1000000 $ROOTDIR/var/named/named.run | gzip -9 >
/tmp/named.run.gz'
 and append the /tmp/named.run.gz file to this bug or send it to me.

 Alternatively, if you do not have at least 1GB of disk space on
 $ROOTDIR/var/named , then please put this logging definition in
 named.conf:

  logging {
           channel named_debug {
                 file "/tmp/named.dbg.log" versions unlimited size 10M;
                 print-time yes;
                 severity debug 99;
           };
           category default_debug {  
                 named_debug;
           };
           category queries {
                 named_debug;
           };
   };
 
 Then files like /tmp/named.dbg.1, /tmp/named.dbg.2, etc will be
 created, containing debugging and query data. If the named process
 again exits, please compress the last (highest numbered)
 /tmp/named.dbg.[0-9]* file and append it to this bug or send it to 
 me.            
                   
 It is rumoured on the bind-* mailing lists that there are certain
 queries of certain zones, usually with misconfigured nameservers,
 that can cause bind9 nameservers to exit, though I've never been 
 able to pin one down yet. Logging queries with the above logging
 statement or just with 'rndc querylog' to send query logs to syslog
 would be invaluable in diagnosing this problem.

 We have BIND 9.2.4 servers in our test lab that have been running
 uninterrupted for months on RHEL-3 systems, so I don't believe the
 problem will turn out to be a named programming error. I'm sorry
 for the problems you are experiencing and will do my best to get
 to fix the root cause of the problem - your help in gathering 
 the information necessary to do so would be much appreciated.

Comment 2 Ronald Cole 2005-01-09 20:40:40 UTC

Kind of nameserver: started with "caching-nameserver" and added
forward and reverse zones for my local network.  Configuration emailed
to you.

I don't know exactly when the named process exits.  There are only
"lame server" entries in /var/log/messages after the startup messages.

And, yes, prior to installing Update 4, I had also been running BIND
9.2.4 uninterrupted for months on three different machines (albeit
with different forward and reverse zones for the different local
networks).

Comment 3 Chris Stankaitis 2005-01-14 17:02:47 UTC

after updating rhel3 u4 named is stoped on my 5 name servers as well,
requiring manuall startup... it also seems to UN-Chkconfig named, as I
have had to chkconfig named on, in all incidents. below is really the
only log splat from named showing that it stops, I assume it should be
a restart, rather then a stop.

Jan 14 16:52:27 ns2 named:  succeeded
Jan 14 16:52:28 ns2 named[488]: shutting down: flushing changes
Jan 14 16:52:28 ns2 named[488]: stopping command channel on 127.0.0.1#953
Jan 14 16:52:30 ns2 named[488]: no longer listening on 127.0.0.1#53
Jan 14 16:52:30 ns2 named[488]: no longer listening on 205.150.101.9#53
Jan 14 16:52:35 ns2 named[488]: exiting

Comment 4 Jason Vas Dias 2005-01-14 17:18:36 UTC

Yes, unfortunately bind-9.2.4-5_EL3 has the problem of not restoring
the chkconfig state during upgrade ( bug #143786 ) . 
bind-9.2.4-7_EL3 fixes this issue and is undergoing QA testing 
prior to being pushed into the RHEL-3 update channels. 
Meanwhile, it can be downloaded from:
 http://people.redhat.com/~jvdias/bind/RHEL-3 

Does the named shut down occur only during / immediately after upgrade?

Or is it that after upgrade, you add back the chkconfig state, restart
named, and then it exits inexplicably at some much later time ?
If so, please collect the debug data as described in comment #1 above.

Comment 5 Chris Stankaitis 2005-01-14 17:39:33 UTC

Stop is a result of upgrade, and is done during the upgrade process.
once update is done a /etc/init.d/named start, starts the nameserver
and it runs without incident. (to date since patching)

Comment 6 Jason Vas Dias 2005-01-14 19:53:44 UTC

named service stop as a result of upgrade is the topic of 
bug 143786, 143558, 140528 and not of this bug, and is fixed
in bind-9.2.4-7_EL3.

Comment 7 Ronald Cole 2005-01-14 20:02:08 UTC

Chris, this is NOT bug 143786.  This problem happened to me AFTER
correcting the problem in bug 143786.  Unfortunately, after making the
changes in comment #1 to create a log and catch a core, it hasn't
crashed again.  I'll keep it running for another week and then change
it back if it doesn't fail by then.  Exec Shield was disabled by me
prior to updating to this version of bind because of bug 134782.

Comment 8 Jason Vas Dias 2005-01-17 17:31:39 UTC

Thanks for supplying the debug files .
I'm now running servers with the same sequence of queries your server
ran before exiting, so far with no exits.
Please can you tell me:

1. What architecture is your named running on ?
   ie. i386 / ia64 / ppc / ppc64 / s390 / s390x /  x86_64 ?

2. Does your machine have more than one CPU ? 
   If so, do you supply the '-n N' option to named to use more than
   one CPU ?
   If so, and i386 / x86_64 platform, is hyperthreading enabled ?


Thank You,
Jason Vas Dias.

Comment 9 Jason Vas Dias 2005-01-17 18:01:38 UTC

When you restart the server, after making the edits as described
above, so /etc/profile should say:
"
ulimit -c unlimited >/dev/null 2>&1
" 
please edit /etc/init.d/named, around line 52, to read:
"
	if [ $conf_ok -eq 1 ]; then	   
#	   daemon /usr/sbin/named -u named ${OPTIONS};
           /sbin/ulimit -c unlimited;
  /usr/sbin/named -u named ${OPTIONS} >/var/log/named.error.log 2>&1 ;
	   RETVAL=$?;
	else
"
ie. please add the 
  ' >/var/log/named.error.log 2>&1 ;'
to the 
  '/usr/sbin/named -u named ${OPTIONS}'
line .

named may be exiting on an assertion failure; if it does, then
it writes a message to stderr, and does not log it.

Thanks!

Comment 10 Ronald Cole 2005-01-18 05:13:50 UTC

Per comment #8: 1) i386.  2) machine is a dual Pentium III-850 and I
did not specify "-n N" for OPTIONS in /etc/sysconfig/named; as a
matter of fact, only ROOTDIR is set.

Per comment #9: done and restarted.

Comment 11 Jason Vas Dias 2005-01-18 15:37:47 UTC

Thanks for your help in trying to reproduce this problem.

I've now set up a RHEL-3 system running bind-9.2.4-5_EL3, which is 
making the sequence of all the queries you sent in the debug log
in a loop. It has now been running all night, with no exits so far,
and the process size is not growing .

To ensure that any core files are generated, it is probably better 
to '# chown named:named $ROOTDIR/var/named; 
    # rm -f $ROOTDIR/var/named/core.*;'

The most important thing is to catch what named writes when it exits,
with an error, by redirecting named's stdout & stderr when run with:
 /usr/sbin/named -u named ${OPTIONS} >/var/log/named.error.log 2>&1 ;
and then attach the /var/log/named.error.log to this bug when it 
exits - thanks.

Comment 12 Ronald Cole 2005-01-18 17:55:41 UTC

Re: comment #11; You do realize that you only asked me for the last
1000000 lines of the debug log, don't you?  I have the entire trace
available if you want it.

Comment 13 Sam Fewster 2005-04-27 13:12:33 UTC

I had named crash on me after U4, left this in /var/log/named.log:

adb.c:753: INSIST((((name) != ((void *)0)) && (((const isc__magic_t *)(name))-
>magic == ((('a') << 24 | ('d') << 16 | ('b') << 8 | ('N')))))) failed
exiting (due to assertion failure)

If this is any help...

Comment 14 Martin Stransky 2006-08-08 11:12:57 UTC

Is this bug still alive?

Comment 15 Ronald Cole 2006-08-08 17:54:18 UTC

The problem seems resolved.  I can't swear to it, but I think a glibc update
released shortly afterwards solved whatever the problem was.

Note You need to log in before you can comment on or make changes to this bug.