Description of problem: After migrating to RHEL5, customer's slave DNS server crashed: Apr 17 04:44:12 adminserver named[24814]: transfer of 'allstontrading.com/IN' from 192.168.12.33#53: end of transfer Apr 17 04:44:12 adminserver named[24814]: rwlock.c:337: fatal error: Apr 17 04:44:12 adminserver named[24814]: RUNTIME_CHECK(((pthread_mutex_destroy(((&rwl->lock))) == 0) ? 0 : 34) == 0) failed Apr 17 04:44:12 adminserver named[24814]: exiting (due to fatal error in library) This matches with an existing, closed bug whose fix has been incorporated into the version of bind the customer is running: https://bugzilla.redhat.com/show_bug.cgi?id=173961 Customer would like for us to analyze the application core to determine the root-cause of the issue. Version-Release number of selected component (if applicable): 5.1 How reproducible: haven't been able to reproduce in-house. customer core files also don't seem to give proper stack traces either. Steps to Reproduce: 1. run named 2. 3. Actual results: named dumps core after a period of time Expected results: named should run without dumping core Additional info: Please assist in the application core analysis, or help with setting up the proper environment for gdb. Using VSP I have created a test machine that replicates the customer's system. I then install all the debuginfo packages for the system from ftp.redhat.com. However, when I run gdb, I still am not able to view the core. It loads, but I only get ?? for symbols. 10.11.243.9 | dhcp243-9.rdu.redhat.com | username: root; password: redhat core is located at /core/core.24814 I am not sure what information I am missing to get the debug symbols to load. If I can get assitance for that I can start the analysis myself. Don't know if having multiple architectures for some debuginfo packages has overwrote information, etc. The customer can generate a stack trace: Here is the gdb output from the customer now with debuginfo packages loaded: ---------------------------------------------------------------------------- Core was generated by `/usr/sbin/named -u named'. Program terminated with signal 6, Aborted. #0 0x00002aaaac437055 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) where #0 0x00002aaaac437055 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00002aaaac438af0 in *__GI_abort () at abort.c:88 #2 0x0000555555573395 in library_fatal_error (file=0x2aaaabb9286d "rwlock.c", line=337, format=0x2aaaabb8f50a "RUNTIME_CHECK(%s) %s", args=0x409ffbc0) at ./main.c:190 #3 0x00002aaaabb6a804 in isc_error_fatal (file=0x6850 "e", line=26705, format=0x6 <Address 0x6 out of bounds>) at error.c:69 #4 0x00002aaaaaf38add in free_rbtdb (rbtdb=0x555571204770, log=1084226768, event=0x0) at rbtdb.c:457 #5 0x00002aaaabb7c7ac in run (uap=<value optimized out>) at task.c:869 #6 0x00002aaaabfbb2f7 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #7 0x00002aaaac4d585d in clone () from /lib64/libc.so.6 (gdb) bt #0 0x00002aaaac437055 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00002aaaac438af0 in *__GI_abort () at abort.c:88 #2 0x0000555555573395 in library_fatal_error (file=0x2aaaabb9286d "rwlock.c", line=337, format=0x2aaaabb8f50a "RUNTIME_CHECK(%s) %s", args=0x409ffbc0) at ./main.c:190 #3 0x00002aaaabb6a804 in isc_error_fatal (file=0x6850 "e", line=26705, format=0x6 <Address 0x6 out of bounds>) at error.c:69 #4 0x00002aaaaaf38add in free_rbtdb (rbtdb=0x555571204770, log=1084226768, event=0x0) at rbtdb.c:457 #5 0x00002aaaabb7c7ac in run (uap=<value optimized out>) at task.c:869 #6 0x00002aaaabfbb2f7 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #7 0x00002aaaac4d585d in clone () from /lib64/libc.so.6 (gdb) run -v Starting program: /usr/sbin/named -v [Thread debugging using libthread_db enabled] [New Thread 46912532599552 (LWP 4407)] BIND 9.3.3rc2 Program exited normally. (gdb) quit this can't be reproduced in-house. ------------------------------- Hi, I agree with Fabio that the issue comes from the destruction of a lock, when destroying it, a routine to verify its validity fails and receives the signal SIGABRT causing the application to stop for safety reasons. This lock could have been deleted or corrupted and we need to investigate this further in detail by having access to a core or more dumps - unfortunately we haven't been able so far in two different systems to obtain any symbols, perhaps the sosreport didn't correspond to the system that dumped the cores? The ideal for us to debug this would be to have a sosreport and a core from the very same machine, this way we would be able to debug this dump, otherwise we would have to speculate with the trace we have had pasted here. If they can't perform this or cannot provide us the dump and sosreport, if they can still debug this, could they provide : (gdb) thread apply all bt full Thanks! Let us know if there are concerns with our update, Jose ----------------------------- I've been looking at the message logs and I see an awful lot of messages like zone allstontrading.com/IN/eq: dns_journal_compact failed: out of range and malformed transaction: allstontrading.com.zone.jnl last serial 150279 != transaction first serial 150277 I'm wondering if this is causing the internal state of named to be confused hence the eventual core dump. I'm looking further into this and I'll keep you posted. alanm --------------------------------------------- One of the error messages: malformed transaction: allstontrading.com.zone.jnl last serial 150279 != transaction first serial 150277 When I researched it relates to the fact that they are using DDNS, which is why the serial numbers are out of order - it updates the dynamic .jnl file first, and then the main file, which results in the out of order serial numbers. Double-checking the other error message "zone allstontrading.com/IN/eq: dns_journal_compact failed: out of range" One of the error messages: malformed transaction: allstontrading.com.zone.jnl last serial 150279 != transaction first serial 150277 When I researched it relates to the fact that they are using DDNS, which is why the serial numbers are out of order - it updates the dynamic .jnl file first, and then the main file, which results in the out of order serial numbers. Double-checking the other error message "zone allstontrading.com/IN/eq: dns_journal_compact failed: out of range"
Created attachment 310797 [details] sosreport
Created attachment 310800 [details] core.24814.gz
Created attachment 310801 [details] gdb-output-with-debuginfo.txt
Created attachment 310802 [details] gdb.core.691
Created attachment 310804 [details] gdb_output
Created attachment 310806 [details] more_gdb
Created attachment 310808 [details] core.26704.gz
(In reply to comment #0) > > Double-checking the other error message "zone allstontrading.com/IN/eq: > dns_journal_compact failed: out of range" One of the error messages: > malformed transaction: allstontrading.com.zone.jnl last serial 150279 != > transaction first serial 150277 > > When I researched it relates to the fact that they are using DDNS, which is why > the serial numbers are out of order - it updates the dynamic .jnl file first, > and then the main file, which results in the out of order serial numbers. > DDNS is not reason. You have same slave zone file names in different views: - view "A" performed IXFR and created journal - view "B" starts IXFR and see journal created by view "A" (due same filename in different views) which ends with "malformed transaction" message. You have to use different slave file names in different views. Although this crash is bug somewhere in named (still not sure where exactly) I think it is reproducable only due this misconfiguration.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.