Bug 453786

Summary: named crashed and generated a core
Product: Red Hat Enterprise Linux 5 Reporter: Alan Matsuoka <alanm>
Component: bindAssignee: Adam Tkac <atkac>
Status: CLOSED WONTFIX QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 5.1CC: duck, ovasik, riek, rvokal, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-03 21:13:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport
none
core.24814.gz
none
gdb-output-with-debuginfo.txt
none
gdb.core.691
none
gdb_output
none
more_gdb
none
core.26704.gz none

Description Alan Matsuoka 2008-07-02 14:54:51 UTC
Description of problem:

After migrating to RHEL5, customer's slave DNS server crashed:
Apr 17 04:44:12 adminserver named[24814]: transfer of 'allstontrading.com/IN'
from 192.168.12.33#53: end of transfer
Apr 17 04:44:12 adminserver named[24814]: rwlock.c:337: fatal error:
Apr 17 04:44:12 adminserver named[24814]:
RUNTIME_CHECK(((pthread_mutex_destroy(((&rwl->lock))) == 0) ? 0 : 34) == 0) failed
Apr 17 04:44:12 adminserver named[24814]: exiting (due to fatal error in library)

This matches with an existing, closed bug whose fix has been incorporated into
the version of bind the customer is running:
https://bugzilla.redhat.com/show_bug.cgi?id=173961

Customer would like for us to analyze the application core to determine the
root-cause of the issue.


Version-Release number of selected component (if applicable):
5.1

How reproducible:

haven't been able to reproduce in-house. customer core files also don't seem to
give proper stack traces either.

Steps to Reproduce:
1. run named
2.
3.
  
Actual results:

named dumps core after a period of time

Expected results:
named should run without dumping core

Additional info:

Please assist in the application core analysis, or help with setting up the
proper environment for gdb.  Using VSP I have created a test machine that
replicates the customer's system.  I then install all the debuginfo packages for
the system from ftp.redhat.com.  However, when I run gdb, I still am not able to
view the core.  It loads, but I only get ?? for symbols.

10.11.243.9    |   dhcp243-9.rdu.redhat.com    |  username: root; password:  redhat
core is located at /core/core.24814

I am not sure what information I am missing to get the debug symbols to load. 
If I can get assitance for that I can start the analysis myself.  Don't know if
having multiple architectures for some debuginfo packages has overwrote
information, etc.

The customer can generate a stack trace:
Here is the gdb output from the customer now with debuginfo packages loaded:
----------------------------------------------------------------------------

Core was generated by `/usr/sbin/named -u named'.
Program terminated with signal 6, Aborted.
#0  0x00002aaaac437055 in *__GI_raise (sig=<value optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) where
#0  0x00002aaaac437055 in *__GI_raise (sig=<value optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00002aaaac438af0 in *__GI_abort () at abort.c:88
#2  0x0000555555573395 in library_fatal_error (file=0x2aaaabb9286d "rwlock.c",
line=337, format=0x2aaaabb8f50a "RUNTIME_CHECK(%s) %s",
   args=0x409ffbc0) at ./main.c:190
#3  0x00002aaaabb6a804 in isc_error_fatal (file=0x6850 "e", line=26705,
format=0x6 <Address 0x6 out of bounds>) at error.c:69
#4  0x00002aaaaaf38add in free_rbtdb (rbtdb=0x555571204770, log=1084226768,
event=0x0) at rbtdb.c:457
#5  0x00002aaaabb7c7ac in run (uap=<value optimized out>) at task.c:869
#6  0x00002aaaabfbb2f7 in start_thread (arg=<value optimized out>) at
pthread_create.c:296
#7  0x00002aaaac4d585d in clone () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaac437055 in *__GI_raise (sig=<value optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00002aaaac438af0 in *__GI_abort () at abort.c:88
#2  0x0000555555573395 in library_fatal_error (file=0x2aaaabb9286d "rwlock.c",
line=337, format=0x2aaaabb8f50a "RUNTIME_CHECK(%s) %s",
   args=0x409ffbc0) at ./main.c:190
#3  0x00002aaaabb6a804 in isc_error_fatal (file=0x6850 "e", line=26705,
format=0x6 <Address 0x6 out of bounds>) at error.c:69
#4  0x00002aaaaaf38add in free_rbtdb (rbtdb=0x555571204770, log=1084226768,
event=0x0) at rbtdb.c:457
#5  0x00002aaaabb7c7ac in run (uap=<value optimized out>) at task.c:869
#6  0x00002aaaabfbb2f7 in start_thread (arg=<value optimized out>) at
pthread_create.c:296
#7  0x00002aaaac4d585d in clone () from /lib64/libc.so.6
(gdb) run -v
Starting program: /usr/sbin/named -v
[Thread debugging using libthread_db enabled]
[New Thread 46912532599552 (LWP 4407)]
BIND 9.3.3rc2

Program exited normally.
(gdb) quit


this can't be reproduced in-house.
-------------------------------
Hi,
I agree with Fabio that the issue comes from the destruction of a lock, when
destroying it, a routine to verify its validity fails and receives the signal
SIGABRT causing the application to stop for safety reasons. This lock could have
been deleted or corrupted and we need to investigate this further in detail by
having access to a core or more dumps - unfortunately we haven't been able so
far in two different systems to obtain any symbols, perhaps the sosreport didn't
correspond to the system that dumped the cores?

The ideal for us to debug this would be to have a sosreport and a core from the
very same machine, this way we would be able to debug this dump, otherwise we
would have to speculate with the trace we have had pasted here.

If they can't perform this or cannot provide us the dump and sosreport, if they
can still debug this, could they provide :

 (gdb) thread apply all bt full

Thanks!
Let us know if there are concerns with our update,

    Jose
-----------------------------
I've been looking at the message logs and I see an awful lot of messages like
zone allstontrading.com/IN/eq: dns_journal_compact failed: out of range

and  malformed transaction: allstontrading.com.zone.jnl last serial 150279 !=
transaction first serial 150277

I'm wondering if this is causing the internal state of named to be confused
hence the eventual core dump.

I'm looking further into this and I'll keep you posted.
alanm
---------------------------------------------

One of the error messages:
malformed transaction: allstontrading.com.zone.jnl last serial 150279 !=
transaction first serial 150277

When I researched it relates to the fact that they are using DDNS, which is why
the serial numbers are out of order - it updates the dynamic .jnl file first,
and then the main file, which results in the out of order serial numbers.

Double-checking the other error message "zone allstontrading.com/IN/eq:
dns_journal_compact failed: out of range" One of the error messages:
malformed transaction: allstontrading.com.zone.jnl last serial 150279 !=
transaction first serial 150277

When I researched it relates to the fact that they are using DDNS, which is why
the serial numbers are out of order - it updates the dynamic .jnl file first,
and then the main file, which results in the out of order serial numbers.

Double-checking the other error message "zone allstontrading.com/IN/eq:
dns_journal_compact failed: out of range"

Comment 1 Alan Matsuoka 2008-07-02 14:54:53 UTC
Created attachment 310797 [details]
sosreport

Comment 4 Alan Matsuoka 2008-07-02 15:00:16 UTC
Created attachment 310800 [details]
core.24814.gz

Comment 5 Alan Matsuoka 2008-07-02 15:01:10 UTC
Created attachment 310801 [details]
gdb-output-with-debuginfo.txt

Comment 6 Alan Matsuoka 2008-07-02 15:01:28 UTC
Created attachment 310802 [details]
gdb.core.691

Comment 7 Alan Matsuoka 2008-07-02 15:01:46 UTC
Created attachment 310804 [details]
gdb_output

Comment 8 Alan Matsuoka 2008-07-02 15:02:07 UTC
Created attachment 310806 [details]
more_gdb

Comment 9 Alan Matsuoka 2008-07-02 15:05:15 UTC
Created attachment 310808 [details]
core.26704.gz

Comment 11 Adam Tkac 2008-09-11 15:45:40 UTC
(In reply to comment #0)
> 
> Double-checking the other error message "zone allstontrading.com/IN/eq:
> dns_journal_compact failed: out of range" One of the error messages:
> malformed transaction: allstontrading.com.zone.jnl last serial 150279 !=
> transaction first serial 150277
> 
> When I researched it relates to the fact that they are using DDNS, which is why
> the serial numbers are out of order - it updates the dynamic .jnl file first,
> and then the main file, which results in the out of order serial numbers.
> 

DDNS is not reason. You have same slave zone file names in different views:

- view "A" performed IXFR and created journal
- view "B" starts IXFR and see journal created by view "A" (due same filename in different views) which ends with "malformed transaction" message.

You have to use different slave file names in different views.

Although this crash is bug somewhere in named (still not sure where exactly) I think it is reproducable only due this misconfiguration.

Comment 14 RHEL Program Management 2009-02-03 21:13:39 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.