Bug 173961
Summary: | Bind crashes with pthread_mutex_destroy failed | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Bojan Smojver <bojan> |
Component: | bind | Assignee: | Jason Vas Dias <jvdias> |
Status: | CLOSED ERRATA | QA Contact: | Ben Levenson <benl> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.0 | CC: | james.gardiner |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2006-0288 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-08-10 21:14:21 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 181409 |
Description
Bojan Smojver
2005-11-23 03:57:03 UTC
C library is: glibc-2.3.4-2.13. BTW, this is a fully up-to-date RHEL4. Oh, one more thing. I got the RPMS that I'm running by building from SRPMS in relation to this bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169416 Any point in trying 12_EL4 first? I does look like a threading library bug at first glance... An upstream bug was found, fixed in the 9.2.6 and 9.3.2 versions, that could explain this problem : 1825. [bug] Missing UNLOCK() on out of memory error from in rbtdb.c:subtractrdataset(). [RT #13519] The code where the assertion fails: rbtdb.c:407: fatal error: RUNTIME_CHECK(((pthread_mutex_destroy(((&rbtdb->node_locks[i].lock))) == 0) ? 0 : 34) == 0) failed shows that a lock that is being destroyed is in fact locked - which should never happen . It could potentially happen if named was transiently out-of-memory and bug 1825 resulted in the lock not being unlocked, and then an attempt was made to destroy the lock, which triggered the assertion failure. And possibly this bug might have a bearing on this problem: 1847. [bug] isc_ondestroy_init() is called too late in dns_rbtdb_create()/dns_rbtdb64_create(). [RT #13661] Anyway, I've now gone through all the upstream patches that have been applied in the 9.2.5 and 9.2.6 ISC releases, and applied those that I can see fix genuine problems in the Linux source code, including the above two problems, to produce a bind-9.2.4-14_EL4 version, which is the candidate for the next RHEL-4 update release . All upstream bugs fixed are listed in the /usr/share/bind-9.2.4/CHANGES file. This release is now available from: http://people.redhat.com/~jvdias/bind/RHEL-4/9.2.4-14_EL4/ Please try it out and let me know of any issues - thanks . It is possible that your named process is getting very large and eventually is running out of memory ( named relies on an in-memory database for its zone and cache database ). If, after you've installed the new bind-9.2.4-14_EL4, the named process runs for a long time and then exits with an 'out of memory' error, then this is the root cause of the problem. Perhaps consider using these named configuration options , as documented in the ARM (/usr/share/doc/bind-9.2.4/arm/Bv9ARM.html): cleaning-interval The server will remove expired resource records from the cache every cleaning-interval minutes. The default is 60 minutes. datasize The maximum amount of data memory the server may use If all else fails, and you can't add more physical memory (or swap) to the machine, consider a periodic cron job to do 'rndc flush' - this will clear out the cache, freeing up alot of memory, but will incur a severe performance hit. NOTE: in order to help diagnose rarely occurring named crashes due to assertion failures, it would be most helpful if you could please change this line in /etc/init.d/named.init, around line 97: from : if [ $conf_ok -eq 1 ]; then daemon /usr/sbin/named -u named ${OPTIONS}; to : if [ $conf_ok -eq 1 ]; then ulimit -c unlimited; /usr/sbin/named -u named ${OPTIONS}; RETVAL=$? if [ $RETVAL -eq 0 ]; then success; else failure; fi; This will enable any named assertion failure or memory access violation that may occur to generate a core dump, in $ROOTDIR/var/named/core.* . This file is likely to be very large, so please ensure you have enough disk space on the $ROOTDIR/var/named/ partition ($ROOTDIR may be set in /etc/sysconfig/named). If a $ROOTDIR/var/named/core.* file is generated, please compress it and send it to me: jvdias - this will enable me to determine exactly what is the cause . > It is possible that your named process is getting very large and eventually
is running out of memory ( named relies on an in-memory database for its
zone and cache database )
Maybe, I'm not sure. In normal operation, named uses around 200 MB of RAM on
this machine, in order to serve 700+ zones. The box has 2 GB of RAM and is not
doing anything but serving DNS (it is a Sun Fire V20z box, with 2 x 2.2 GHz
Opterons). We have another box doing exact same thing on exactly the same
hardware as well, using slightly less memory. Each box serves between 600 and
2,500 request per second, if that somehow matters.
Thank you for your ultra quick response, pointers to DNS configuration and new
RPMS. I'll try them on and report back if the problem occurs again. These two
machines have been running for a significant time now and this was the only
incident we had, so the problem appears to be intermittent and probably depends
on conditions on our network (at times broken clients do weird things). So, it
may never occur again. I'm not sure if I should hope that it does or that it
does not, though ;-)
Running it now. Will report back if/when it crashes. Now the other machine's named process crashed, with a slightly different (but similar) error: --------------------------------------- Dec 10 23:00:20.280 loading configuration from '/var/named/named.conf' Dec 10 23:00:48.226 adb.c:2181: fatal error: Dec 10 23:00:48.226 RUNTIME_CHECK(((pthread_mutex_destroy(((&adb->mplock))) == 0 ) ? 0 : 34) == 0) failed Dec 10 23:00:48.242 exiting (due to fatal error in library) --------------------------------------- Looks like it crashed on HUP, while reloading the data. Looks like this version of bind didn't make into U3... Today, I had another one during the HUP: --------------------------------------- Mar 15 23:00:41.611 adb.c:2181: fatal error: Mar 15 23:00:41.612 RUNTIME_CHECK(((pthread_mutex_destroy(((&adb->mplock))) == 0 ) ? 0 : 34) == 0) failed Mar 15 23:00:41.625 exiting (due to fatal error in library) --------------------------------------- Hi Bojan - Thanks for the report. Please confirm: o Which bind version are you running on the machine on which this occurred? Are you running the bind-9.2.4-16.EL4 or bind-9.3.2-1.EL4 test rpms from http://people.redhat.com/~jvdias/bind/RHEL-4 , or the standard bind-9.2.4-10 RPMs ? o Does the problem only occur when you do an 'rndc reload' or 'kill -HUP `pidof named`' ? Do you do the 'rndc reload' or kill named manually ? o Is SELinux enabled in Enforcing or Permissive mode? # getenforce The next time you're going to do an rndc reload, please enable named debugging, as follows: 1. If SELinux is Enabled, ensure you have this logging configuration in named.conf: 'logging { channel default_debug { file "data/named.run"; severity dynamic;} };' OR that you do: # setsebool named_write_master_zones 1; chmod g+w $ROOTDIR/var/named 2. Then, whenever you are about to do the 'rndc reload' / kill -HUP, do: # rndc trace 99 # rndc reload If the named process crashes, then there will be a debug log in $ROOTDIR/var/named/data/named.run or $ROOTDIR/var/named/named.run. Please send this file to me (jvdias) or append it to this bug report. If there is no crash, then do: # rndc trace 0 to turn off debugging. With the named.run data gathered as above, it should be possible to trace exactly why this problem is occurring - thanks. > Which bind version are you running on the machine on which this occurred bind-9.2.4-14_EL4 > Does the problem only occur when you do an 'rndc reload' or 'kill -HUP `pidof named`' We always send HUP, so I cannot say about rndc reload. And the problem is very intermittent that it would take months to verify. But, I can do it if it'll make a difference. > Is SELinux enabled in Enforcing or Permissive mode? Disabled I'm interested in getting 9.3.x anyway, so I'll download the latest RPMS and try with that. I'm guessing these things have been tested to some extent, right? They won't just crash on me out of the blue? :-) RE: > I'm guessing these things have been tested to some extent, right? Yes, bind-9.3.2-1.EL4 has passed the complete bind test suite, and our internal regression tests - I've been using it for the past two months on RHEL-4 with no problems so far , as have other RHEL-4 customers who've requested it. It is not "officially supported" by Red Hat, but I as the bind package maintainer do undertake to support it. RE: > We always send HUP I would definitely suggest using 'rndc reload' instead. The difference is, that 'rndc reload' sends a message to named, so named processes it in its normal select() loop, while an 'HUP' signal could interrupt named during a system call - perhaps this is what may be causing the problem - does the problem ONLY occur when you send the HUP ? Anyway, since you are not using SELinux, named can create the normal $ROOTDIR/var/named/named.run file OK - please, when you want to reload the named configuration, do: 'rndc trace 99; rndc reload; rndc trace 0; ' You could even put these commands in a "reload_named" script. If the problem occurs, then the 'rndc trace 0' will fail, since named is no longer running, and there will be a $ROOTDIR/var/named/named.run file containing debug messages that might help us get to the root cause of this problem - thanks. > Yes, bind-9.3.2-1.EL4 has passed the complete bind test suite, and our internal regression tests Nice. I'll whack it on today. > does the problem ONLY occur when you send the HUP Yes. After you applied some backports to the RPM, the initial problems never occured again. But, the HUP problem started happening (this may have been there from the start, of course, but it may have not been triggered. We'll definitely change the way we reload and will use rndc. > rndc trace Thanks for the tip. Will implement in our reload script. Let's see what 9.3.2 does in the meantime... Just had a crash with 9.3.2 (bind-9.3.2-1_EL4). The only message is this: ------------------------------------- Apr 4 10:48:28 joule kernel: named[3827]: segfault at 000000000000002c rip 0000002a958332f9 rsp 00000000413ff7f0 error 4 ------------------------------------- No idea what that's supposed t mean (except that it's a segfault, of course). Anyhow, this was not during a reload or anything - just normal operation. Any suggestions welcome... And again: ------------------------------------- Apr 4 11:19:31 joule kernel: named[5063]: segfault at 000000000000002c rip 0000002a958332f9 rsp 00000000413ff7f0 error 4 ------------------------------------- Hi Bojan - Many thanks for reporting this. Happily, the 2.6 x86_64 kernel has told us the instruction address where the core occurred, and I can use GDB with bind-debuginfo installed: (gdb) break *0x2a958332f9 Breakpoint 2 at 0x2a958332f9: file resolver.c, line 3797 which is in function ncache_adderesult, which adds a result to the cache: } else if ( result == ISC_R_SUCCESS ) { 3797: if( NXDOMAIN(ardataset)) ... and NXDOMAIN is: #define NXDOMAIN(r) ((r)->attributes & DNS_RDATASETATTR_NXDOMAIN)) so it appears "ardataset" is null ( an empty rdataset was returned for a query that did not return an error ) - this is also a legally cacheable result, so the code should be checking for ( ardataset == NULL ) here. This might possibly be due to one of these ISC BIND 9.3.2 bugs (for which ISC have not yet released the fix): 1941. [bug] ncache_adderesult() should set eresult even if no rdataset is passed to it. [RT #15642] 1939. [bug] The resolver could dereference a null pointer after validation if all the queries have timed out. [RT #15528] I've sent my proposed patch upstream to check if this is a known issue or not. Anyway, there appears to be a simple fix for this issue - I've applied it with bind-9.3.2-2.EL4, available from: http://people.redhat.com/~jvdias/bind/RHEL-4/9.3.2-2.EL4 . This version also passes all tests - please try it out and let me know of any issues. Regards, Jason Vas Dias BIND package maintainer WOW - that was quick. Thanks heaps! I'll be putting this baby in production today and will report back if it breaks. In production now. Fingers crossed... This issue is on Red Hat Engineering's list of planned work items for the upcoming Red Hat Enterprise Linux 4.4 release. Engineering resources have been assigned and barring unforeseen circumstances, Red Hat intends to include this item in the 4.4 release. So far, so good with bind-9.3.2-2_EL4... An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0288.html *** Bug 202393 has been marked as a duplicate of this bug. *** |