From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.3) Gecko/20040924 Description of problem: Yes, I know it's a bad summary, but I cannot reproduce it at will, nor predict when or why it will fail because it leaves no indication of what went wrong in a log file anywhere. It just dies suddenly and unexpectedly! Named had been running flawlessly for me since I installed it over a year ago until I upgraded to RHEL3ES Update 4. Anyway, I open this bug so that others who've had it mysteriously crash on them after the upgrade can add to the braintrust required to resolve it. Version-Release number of selected component (if applicable): bind-9.2.4-5_EL3 How reproducible: Sometimes Steps to Reproduce: 1. Upgrade bind (and maybe caching-nameserver) to Update 4. 2. 3. Actual Results: Named will die after a day or two, unpredictably and mysteriously. Expected Results: Named to run and run and run, as it used to before Update 4. Additional info: Also updated at the time was caching-nameserver to 7.3-3_EL3.
What kind of nameserver are you running ? Is it a caching only nameserver ? If not, you must have replaced the caching-nameserver config files with your own. It would be most helpful in diagnosing the problem to see these config files - if you don't want to append them to this bug where they can be publicly viewed, please send them to jvdias - I'll treat them as confidential . If it is serving external clients, what is the load like on the server ? Doing an 'rndc stats' after the server is experiencing normal load for at least an hour and appending the stats file to the bug or sending it to me would be helpful - the stats file is set by the caching-nameserver default config files to be: statistics-file "data/named.stats.txt"; Are there really no 'named:' log messages in /var/log/messages from just before the named process exits, such as 'out of memory' errors ? If there are 'out of memory' errors, you can adjust the 'ulimit' of the named process - do you set the 'stacksize' or 'datasize' named.conf 'options{ }' ? If not, it may be that named is experiencing a memory access violation, and would dump a core file in $ROOTDIR/var/named where $ROOTDIR is set in /etc/sysconfig/named, IF core file generation was not disabled by default, which it is. Please edit the /etc/profile file around line 28, to read: " # No core files by default # ulimit -S -c 0 > /dev/null 2>&1 ulimit -c unlimited >/dev/null 2>&1 " and edit /etc/init.d/named, around line 52, to read: " if [ $conf_ok -eq 1 ]; then # daemon /usr/sbin/named -u named ${OPTIONS}; /sbin/ulimit -c unlimited /usr/sbin/named -u named ${OPTIONS}; RETVAL=$?; else " Then if named generates a memory access violation, there will be a $ROOTDIR/var/named/core.$pid file, where $pid is the process id of the named process which crashed - if there are any of these files, please gzip them and append them to this bug or send them to me. Gathering some named logging information would also be useful . If the load on the server is light, and you have loads of disk space - AT LEAST 1GB - on the $ROOTDIR/var/named partition, after the named process starts do a 'rndc trace 99' . Then named will create a file in $ROOTDIR/var/named called 'named.run', containing debug log messages. If the crash is then observed, do a 'tail -1000000 $ROOTDIR/var/named/named.run | gzip -9 > /tmp/named.run.gz' and append the /tmp/named.run.gz file to this bug or send it to me. Alternatively, if you do not have at least 1GB of disk space on $ROOTDIR/var/named , then please put this logging definition in named.conf: logging { channel named_debug { file "/tmp/named.dbg.log" versions unlimited size 10M; print-time yes; severity debug 99; }; category default_debug { named_debug; }; category queries { named_debug; }; }; Then files like /tmp/named.dbg.1, /tmp/named.dbg.2, etc will be created, containing debugging and query data. If the named process again exits, please compress the last (highest numbered) /tmp/named.dbg.[0-9]* file and append it to this bug or send it to me. It is rumoured on the bind-* mailing lists that there are certain queries of certain zones, usually with misconfigured nameservers, that can cause bind9 nameservers to exit, though I've never been able to pin one down yet. Logging queries with the above logging statement or just with 'rndc querylog' to send query logs to syslog would be invaluable in diagnosing this problem. We have BIND 9.2.4 servers in our test lab that have been running uninterrupted for months on RHEL-3 systems, so I don't believe the problem will turn out to be a named programming error. I'm sorry for the problems you are experiencing and will do my best to get to fix the root cause of the problem - your help in gathering the information necessary to do so would be much appreciated.
Kind of nameserver: started with "caching-nameserver" and added forward and reverse zones for my local network. Configuration emailed to you. I don't know exactly when the named process exits. There are only "lame server" entries in /var/log/messages after the startup messages. And, yes, prior to installing Update 4, I had also been running BIND 9.2.4 uninterrupted for months on three different machines (albeit with different forward and reverse zones for the different local networks).
after updating rhel3 u4 named is stoped on my 5 name servers as well, requiring manuall startup... it also seems to UN-Chkconfig named, as I have had to chkconfig named on, in all incidents. below is really the only log splat from named showing that it stops, I assume it should be a restart, rather then a stop. Jan 14 16:52:27 ns2 named: succeeded Jan 14 16:52:28 ns2 named[488]: shutting down: flushing changes Jan 14 16:52:28 ns2 named[488]: stopping command channel on 127.0.0.1#953 Jan 14 16:52:30 ns2 named[488]: no longer listening on 127.0.0.1#53 Jan 14 16:52:30 ns2 named[488]: no longer listening on 205.150.101.9#53 Jan 14 16:52:35 ns2 named[488]: exiting
Yes, unfortunately bind-9.2.4-5_EL3 has the problem of not restoring the chkconfig state during upgrade ( bug #143786 ) . bind-9.2.4-7_EL3 fixes this issue and is undergoing QA testing prior to being pushed into the RHEL-3 update channels. Meanwhile, it can be downloaded from: http://people.redhat.com/~jvdias/bind/RHEL-3 Does the named shut down occur only during / immediately after upgrade? Or is it that after upgrade, you add back the chkconfig state, restart named, and then it exits inexplicably at some much later time ? If so, please collect the debug data as described in comment #1 above.
Stop is a result of upgrade, and is done during the upgrade process. once update is done a /etc/init.d/named start, starts the nameserver and it runs without incident. (to date since patching)
named service stop as a result of upgrade is the topic of bug 143786, 143558, 140528 and not of this bug, and is fixed in bind-9.2.4-7_EL3.
Chris, this is NOT bug 143786. This problem happened to me AFTER correcting the problem in bug 143786. Unfortunately, after making the changes in comment #1 to create a log and catch a core, it hasn't crashed again. I'll keep it running for another week and then change it back if it doesn't fail by then. Exec Shield was disabled by me prior to updating to this version of bind because of bug 134782.
Thanks for supplying the debug files . I'm now running servers with the same sequence of queries your server ran before exiting, so far with no exits. Please can you tell me: 1. What architecture is your named running on ? ie. i386 / ia64 / ppc / ppc64 / s390 / s390x / x86_64 ? 2. Does your machine have more than one CPU ? If so, do you supply the '-n N' option to named to use more than one CPU ? If so, and i386 / x86_64 platform, is hyperthreading enabled ? Thank You, Jason Vas Dias.
When you restart the server, after making the edits as described above, so /etc/profile should say: " ulimit -c unlimited >/dev/null 2>&1 " please edit /etc/init.d/named, around line 52, to read: " if [ $conf_ok -eq 1 ]; then # daemon /usr/sbin/named -u named ${OPTIONS}; /sbin/ulimit -c unlimited; /usr/sbin/named -u named ${OPTIONS} >/var/log/named.error.log 2>&1 ; RETVAL=$?; else " ie. please add the ' >/var/log/named.error.log 2>&1 ;' to the '/usr/sbin/named -u named ${OPTIONS}' line . named may be exiting on an assertion failure; if it does, then it writes a message to stderr, and does not log it. Thanks!
Per comment #8: 1) i386. 2) machine is a dual Pentium III-850 and I did not specify "-n N" for OPTIONS in /etc/sysconfig/named; as a matter of fact, only ROOTDIR is set. Per comment #9: done and restarted.
Thanks for your help in trying to reproduce this problem. I've now set up a RHEL-3 system running bind-9.2.4-5_EL3, which is making the sequence of all the queries you sent in the debug log in a loop. It has now been running all night, with no exits so far, and the process size is not growing . To ensure that any core files are generated, it is probably better to '# chown named:named $ROOTDIR/var/named; # rm -f $ROOTDIR/var/named/core.*;' The most important thing is to catch what named writes when it exits, with an error, by redirecting named's stdout & stderr when run with: /usr/sbin/named -u named ${OPTIONS} >/var/log/named.error.log 2>&1 ; and then attach the /var/log/named.error.log to this bug when it exits - thanks.
Re: comment #11; You do realize that you only asked me for the last 1000000 lines of the debug log, don't you? I have the entire trace available if you want it.
I had named crash on me after U4, left this in /var/log/named.log: adb.c:753: INSIST((((name) != ((void *)0)) && (((const isc__magic_t *)(name))- >magic == ((('a') << 24 | ('d') << 16 | ('b') << 8 | ('N')))))) failed exiting (due to assertion failure) If this is any help...
Is this bug still alive?
The problem seems resolved. I can't swear to it, but I think a glibc update released shortly afterwards solved whatever the problem was.