Bug 400461

Summary:	named segfault after receiving a transfer/sending notifies
Product:	[Fedora] Fedora	Reporter:	Charles R. Anderson <cra>
Component:	bind	Assignee:	Adam Tkac <atkac>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	8	CC:	bilias, fs, k.georgiou, mark, ovasik, rhbz001, scott
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	9.5.0-26.b3.fc8	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-05-10 13:50:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Charles R. Anderson 2007-11-27 01:53:22 UTC

Description of problem:

named crashed with a segfault after transferring a zone from the master and
sending out notifies.  The server is configured as a slave server with multiple
zones and views and a recursive query caching server.  There are no other
servers that it needs to be notifying, but the default configuration appears to
invoke the "sending notifies" routine.

Version-Release number of selected component (if applicable):
9.5.0-18.a7.fc8

Actual results:

Here are some syslog messages from right before the crash:

Nov 26 19:21:19 slave named[25468]: client w.x.y.z#53320: view global: received
notify for zone 'SUB.DOM.FOO'
Nov 26 19:21:19 slave named[25468]: client w.x.y.z#53320: view global: received
notify for zone 'x.w.IN-ADDR.ARPA'
Nov 26 19:21:19 slave named[25468]: zone SUB.DOM.FOO/IN/global: Transfer started.
Nov 26 19:21:19 slave named[25468]: transfer of 'SUB.DOM.FOO/IN/global' from
w.x.y.z#53: connected using w.x.a.b#52660
Nov 26 19:21:19 slave named[25468]: zone SUB.DOM.FOO/IN/global: transferred
serial 357792
Nov 26 19:21:19 slave named[25468]: transfer of 'SUB.DOM.FOO/IN/global' from
w.x.y.z#53: end of transfer
Nov 26 19:21:19 slave named[25468]: zone SUB.DOM.FOO/IN/global: sending notifies
(serial 357792)
Nov 26 19:21:20 slave named[25468]: zone x.w.IN-ADDR.ARPA/IN/global: Transfer
started.
Nov 26 19:21:20 slave named[25468]: transfer of 'x.w.IN-ADDR.ARPA/IN/global'
from w.x.y.z#53: connected using w.x.a.b#52373
Nov 26 19:21:20 slave named[25468]: zone x.w.IN-ADDR.ARPA/IN/global: transferred
serial 382352
Nov 26 19:21:20 slave named[25468]: transfer of 'x.w.IN-ADDR.ARPA/IN/global'
from w.x.y.z#53: end of transfer
Nov 26 19:21:20 slave named[25468]: zone x.w.IN-ADDR.ARPA/IN/global: sending
notifies (serial 382352)
Nov 26 19:21:20 slave kernel: named[25471]: segfault at 00000014 eip 001a7996
esp b6a82db0 error 4

Additional Info:

The configuration has not changed.  The system was recently upgraded from FC6 to
F8.  The bind package was updated on November 23, and had been running since
then without problems until today.

Comment 1 Lubomir Kundrak 2007-11-27 13:42:16 UTC

Are you able to reproduce the problem? Do you have a core dump?

Comment 2 Adam Tkac 2007-11-27 14:35:37 UTC

Could you please tell me what arch do you have (and if you're running named
multithreaded) and attach configuration file + full log?

Thanks

Comment 3 Charles R. Anderson 2007-11-29 06:14:39 UTC

Problem hasn't happened again yet.  I don't have a core dump, but this is what I
have done to hopefully get a core dump if it happens again:

/etc/sysconfig/named:
DAEMON_COREFILE_LIMIT=unlimited

/etc/profile:
ulimit -S -c unlimited > /dev/null 2>&1

# getsebool -a | grep core
allow_daemons_dump_core --> on

/etc/named.conf:
options {
        coresize unlimited;
};

This is i386 (i686) arch.  Not sure if I'm running multithreaded--the
/etc/sysconfig/named is default except for modification mentioned above.

Comment 4 Adam Tkac 2007-11-29 08:14:47 UTC

Thanks for your cooperation, best approach. You will see how many threads you
have in log. For example
Nov 29 09:12:31 evileye named[18117]: starting BIND 9.5.0a7 -u named
Nov 29 09:12:31 evileye named[18117]: found 2 CPUs, using 2 worker threads

Comment 5 Charles R. Anderson 2007-11-30 15:25:06 UTC

Now named has crashed on a different server.  I've now set up all my named's to
generate coredumps.  Hopefully we won't have to wait more than a few more days
for one of them to crash, although they had all been running since Nov 22, so it
may take as long as 8 days from today since I just restarted them all.

This time it was the shadow master named that crashed, again right after
"sending notifies":

Nov 30 09:49:51 master named[7316]: zone SUB.FOO.DOM/IN/global: sending notifies
(serial 379376)
Nov 30 09:49:51 master kernel: named[7317]: segfault at 0000001c eip 001a7996
esp b7ea9db0 error 4

The master is running 4 threads:

Nov 30 09:56:14 master named[7681]: found 4 CPUs, using 4 worker threads

The previous slave that crashed is also running 4 threads:

Nov 26 22:31:34 slave2 named[31792]: found 4 CPUs, using 4 worker threads

Comment 6 Adam Tkac 2007-11-30 17:18:05 UTC

Yes, it looks that crash will be related to notify messages. I recommend you add
"notify master-only;" option to control statement on slave servers. It will
improve stability and I'm going to try reproduce this problem next week in our lab.

Comment 7 Charles R. Anderson 2007-12-03 23:28:29 UTC

It crashed again twice on the master, but didn't generate a core dump.  I'm not
sure what is going on...I had put SELinux into permissive mode because
allow_daemons_dump_core only allows coredumps in / (named's CWD is /var/named).
 So SELinux cannot be causing the coredumps to fail.

Dec  3 12:57:22 master kernel: named[18004]: segfault at 00000014 eip 001a7996 
esp b7ea2db0 error 4
Dec  3 13:13:24 master named[21934]: starting BIND 9.5.0a7 -u named
Dec  3 13:13:24 master named[21934]: found 4 CPUs, using 4 worker threads

Any suggestions on how to get the thing to dump core next time?

Comment 8 Charles R. Anderson 2007-12-04 00:02:42 UTC

By doing "sh -x /etc/init.d/named restart" I confirm that ulimit is being set
correctly before starting the process:

+ /bin/bash -c 'ulimit -S -c unlimited >/dev/null 2>&1 ; /usr/sbin/named -u named'

I turned on debugging:

/etc/sysconfig/named:
OPTIONS="-d 99"

/etc/named.conf:
logging {
      channel default_debug {
              file "data/named.run";
              severity dynamic;
      };
};

Comment 9 Adam Tkac 2007-12-04 13:14:52 UTC

I did some investigation about coredumps and it looks that named wrongly sets
limit itself. I'm not able to generate coredump with all versions (9.3, 9.4 and
9.5). Let me check it, looks like next bug for me. Also be careful with -d99
option because log will be really huge.

Comment 10 Adam Tkac 2007-12-04 14:46:17 UTC

I the end problem with coredumps is in setuid function (see man setuid(2)).
named is not able to generate core files when is started with -u option. You
don't have to try get core file because it's quite risky run named without -u.
I'm also trying reproduce this one.

Comment 11 Charles R. Anderson 2007-12-04 23:49:19 UTC

named wouldn't start without the -u named option, or with -u root substituted. 
It seems that it is dropping privs (CAP_DAC_OVERRIDE?) even when not doing
setuid, such that it cannot write to /var/named/data or indeed any of the
logging channel files which are owned by named.

Comment 12 Charles R. Anderson 2007-12-05 00:09:58 UTC

Interesting, this piece of code appears to not be working:

bin/named/unix/os.c:258
        /* allow us to continue to be able to generate coredumps after setuid: 
                                                                               
                                      
         */
        prctl(PR_SET_DUMPABLE,1,0,0,0);

Comment 13 Frank Sweetser 2007-12-05 15:06:15 UTC

Poking around on the affected servers, I found /proc/sys/fs/suid_dumpable was
set to 0.  Neither of the man pages (proc(5) and prctl(2)) specify which setting
takes preference when they conflict, so just to be safe I've set
fs/suid_dumpable to 2 on all the relevant servers.

Comment 14 Adam Tkac 2007-12-05 16:37:19 UTC

(In reply to comment #12)
> bin/named/unix/os.c:258
>         /* allow us to continue to be able to generate coredumps after setuid: 
>                                    
>          */
>         prctl(PR_SET_DUMPABLE,1,0,0,0);
> 

Upstream also doesn't know if this statement works as expected :) I've setup two
slaves and one master and I'm trying to reproduce this problem with simple script:

#!/bin/bash

let a=0

while [ 0 -eq -0 ]; do
        cat /var/named/test | sed 's/[0-9]*;serial/'`echo $a`';serial/' >
/var/named/test.new
        mv -f /var/named/test.new /var/named/test
        rndc reload;
        ((++a))
        sleep 3;
done

I think I should get core file tomorrow.

Comment 15 Charles R. Anderson 2007-12-05 23:41:24 UTC

Re Comment #13, I don't think setting /proc/sys/fs/suid_dumpable to 2 is a good
idea.  It should be set to 1 instead:

    PR_SET_DUMPABLE
              (Since  Linux  2.3.20)  Set  the  state  of the flag determining
              whether core dumps are produced for this process  upon  delivery
              of  a  signal whose default behaviour is to produce a core dump.
              (Normally this flag is set for a process by default, but  it  is
              cleared  when  a set-user-ID or set-group-ID program is executed
              and also by various system calls that  manipulate  process  UIDs
              and  GIDs).  In kernels up to and including 2.6.12, arg2 must be
              either 0 (process is not dumpable) or 1 (process  is  dumpable).
              Between  kernels 2.6.13 and 2.6.17, the value 2 was also permit-
              ted, which caused any binary which normally would not be  dumped
              to  be  dumped readable by root only; for security reasons, this
              feature  has  been  removed.   (See  also  the  description   of
              /proc/sys/fs/suid_dumpable in proc(5).)

proc(5) doesn't mention the comment about the 2 feature being removed after
2.6.17, but I would rather be conservative about it right now, and just set
/proc/sys/fs/suid_dumpable to 1.

Also, I noticed that /usr/sbin/named was set to mode 750.  I don't know why, but
this seems like a bug to me.  There is no reason binaries in /usr/sbin shouldn't
be mode 755 just like binaries in /usr/bin.  I've changed the permissions to 755
 due to another comment I read in proc(5): "core dump will not be produced for a
process ... whose binary does not have read permission enabled.":

       /proc/sys/fs/suid_dumpable (since Linux 2.6.13)
              The  value  in  this file determines whether core dump files are
              produced for set-user-ID or  otherwise  protected/tainted  bina-
              ries.  Three different integer values can be specified:

              0 (default)  This  provides  the  traditional (pre-Linux 2.6.13)
              behaviour.  A core dump will not be produced for a process which
              has  changed  credentials  (by calling seteuid(2), setgid(2), or
              similar, or by executing a set-user-ID or set-group-ID  program)
              or whose binary does not have read permission enabled.

              1 ("debug")  All  processes  dump  core when possible.  The core
              dump is owned by the file system user ID of the dumping  process
              and  no security is applied.  This is intended for system debug-
              ging situations only.  Ptrace is unchecked.

              2 ("suidsafe") Any binary which normally  would  not  be  dumped
              (see  "0"  above)  is dumped readable by root only.  This allows
              the user to remove the core dump file but not to read  it.   For
              security  reasons core dumps in this mode will not overwrite one
              another or other files.  This mode is appropriate when  adminis-
              trators  are  attempting  to debug problems in a normal environ-
              ment.

Comment 16 Adam Tkac 2007-12-11 14:40:25 UTC

Two slaves run five days without segfault. I still don't know where problem is...

Comment 17 Mark Andrews 2008-01-04 05:31:17 UTC

Anyone checked to see if ~!@#@$%%^& SELinux is getting in the way?

The directory listed in named.conf is SUPPOSED to be WRITABLE by
the user named is running as.

Comment 18 Charles R. Anderson 2008-01-04 05:46:15 UTC

Yes, SELinux has been checked.  The code in BIND that sets/drops capabilities is
a huge mess.  There are bugs lurking in there.  I haven't had time to sort it
out yet, but I'm pretty sure thats why it won't dump core even with SELinux
disabled.

Comment 19 John Holmstadt 2008-01-09 15:07:34 UTC

I'm seeing a similar problem on our master system. Thought I'd post some logfile
info if it was of any help...

[root@myserver ~]# grep -B1 segfault /var/log/messages*
/var/log/messages-Jan  9 04:16:42 myserver named[12846]: client
10.1.1.106#35938: updating zone 'mydomain.com/IN': deleting rrset at
'cwestber.mydomain.com' A
/var/log/messages:Jan  9 04:16:42 myserver kernel: named[12847]: segfault at
0000000c eip 001a7996 esp b7ec0db0 error 4
--
/var/log/messages-20071230-Dec 28 09:54:22 myserver named[1947]: client
10.1.1.30#34343: updating zone 'mydomain.com/IN': deleting rrset at
'dgarces.mydomain.com' A
/var/log/messages-20071230:Dec 28 09:54:22 myserver kernel: named[1948]:
segfault at 00000034 eip 001a7996 esp b7e4edb0 error 4
--
/var/log/messages-20071230-Dec 28 12:56:08 myserver named[7758]: client
10.1.0.210#54302: updating zone 'mydomain.com/IN': deleting rrset at
'wanderso.mydomain.com' A
/var/log/messages-20071230:Dec 28 12:56:08 myserver kernel: named[7759]:
segfault at 0000000c eip 001a7996 esp b7e65db0 error 4


As you can see, the eip and esp numbers are the same as in the segfault reported
above. I'm running bind-9.5.0-20.b1.fc8.

Let me know if there's any more info or help I could provide to help solve this.

Comment 20 Adam Tkac 2008-01-11 16:04:24 UTC

(In reply to comment #19)
Yes, looks that named always segfault on same point. But I cannot find where
exactly. I disassembled named and all libraries and I can't find where exactly
eip points :( Looks like this approach is unusable for PIE binaries. Also from
BIND CVS it looks that upstream doing some work to get named able to generate
core files. I'm going to check situation with newest codebase. I can do nothing
with this before I got core file. Reading code without any knowledge where
exactly problem is doesn't make sence

Comment 21 Adam Tkac 2008-01-21 11:38:54 UTC

I dug how get core dump and this is solution:
 - change owner/permissions of /var/named directory to root:named 775
 - allow SELinux write there (simpliest way is Permissive mode)
 - nothing else is needed

Best should be get core file from binaries whose are compiled with -O0. If it is
possible please use those:
http://koji.fedoraproject.org/koji/taskinfo?taskID=361783 . Core file may be
very large especially on production servers so you could use max-cache-size
option (for example 32MB) to reduce number or cached records.

Thanks, Adam

Comment 22 Charles R. Anderson 2008-01-29 18:52:53 UTC

Using bind-9.5.0-23.b1.fc8, I changed the CWD to /var/named/data and got this
coredump finally:

Core was generated by `/usr/sbin/named -u named'.
Program terminated with signal 11, Segmentation fault.
#0  0x001a7996 in decrement_reference (rbtdb=0xb27193a8, node=0xb32da3f8,
least_serial=23075, nlock=isc_rwlocktype_write, tlock=isc_rwlocktype_none) at
rbtdb.c:1638
1638                    ISC_LIST_APPEND(rbtdb->deadnodes[bucket], node, deadlink);
(gdb) where
#0  0x001a7996 in decrement_reference (rbtdb=0xb27193a8, node=0xb32da3f8,
least_serial=23075, nlock=isc_rwlocktype_write, tlock=isc_rwlocktype_none) at
rbtdb.c:1638
#1  0x001a9e7f in closeversion (db=0xb27193a8, versionp=0xb0c59044,
commit=isc_boolean_false) at rbtdb.c:1927
#2  0x00168353 in dns_db_closeversion (db=0xb27193a8, versionp=0xb0c59044,
commit=isc_boolean_false) at db.c:428
#3  0x0018c0aa in dns_dumpctx_detach (dctxp=0xb610a328) at masterdump.c:1024
#4  0x0018c92b in dump_quantum (task=0xb7f24070, event=0x0) at masterdump.c:1158
#5  0x00446a32 in run (uap=0xb7f12008) at task.c:874
#6  0x0048b50b in start_thread () from /lib/libpthread-2.7.so
#7  0x00572b2e in clone () from /lib/libc-2.7.so

Comment 23 Adam Tkac 2008-02-04 13:17:50 UTC

Patch for this issue is commited in upstream CVS

Comment 24 Fedora Update System 2008-02-04 13:38:33 UTC

bind-9.5.0-24.b1.fc8 has been submitted as an update for Fedora 8

Comment 25 Adam Tkac 2008-02-06 11:09:06 UTC

Patch from upstream is still not correct, delaying update

Comment 26 Fedora Update System 2008-02-12 14:12:12 UTC

bind-9.5.0-25.b2.fc8 has been submitted as an update for Fedora 8

Comment 27 Fedora Update System 2008-02-13 05:16:09 UTC

bind-9.5.0-25.b2.fc8 has been pushed to the Fedora 8 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update bind'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-1622

Comment 28 Charles R. Anderson 2008-02-25 16:30:10 UTC

bind-9.5.0-25.b2.fc8 works much better for me.

Comment 29 Scott Baker 2008-02-26 16:14:11 UTC

I've had MULTIPLE crashes since upgrading to bind-9.5.0-25.b2.fc8. 

:/usr/sbin/named -v
BIND 9.5.0b2

:rpm -q bind
bind-9.5.0-25.b2.fc8

I was running a patched version compiled from source and had no problems until
switching back to the RPM.

Comment 30 Scott Baker 2008-02-26 17:12:14 UTC

I just re-confirmed that the version I posted above is still crashing. In fact
it's crashing MORE often than the previous RPM. I'm going to have to switch back
to my source compiled version for stability.

Comment 31 Adam Tkac 2008-02-27 11:22:12 UTC

(In reply to comment #30)
> I just re-confirmed that the version I posted above is still crashing. In fact
> it's crashing MORE often than the previous RPM. I'm going to have to switch back
> to my source compiled version for stability.

Would it be possible provide more info, please? In which situation named
crashes? Is this related to zone transfer? Would it be possible to get core file
(or backtrace at least)? Thanks

Comment 32 Scott Baker 2008-02-27 23:34:15 UTC

Sorry I should have been more explicit. Yes it occurs during (or immediately
after) a zone transfer. I have a specific zone the gets transferred once every
couple of seconds (DDNS updates), and eventually causes the crash.

I can post the core file somewhere if you think it will help. It's about 100
megs though. Running gdb on it I don't get much info except that it was a segfault.



Thread 7 (process 24285):
#0  0x0012d410 in __kernel_vsyscall ()
No symbol table info available.
#1  0x004c9c07 in sigsuspend () from /lib/libc.so.6
No symbol table info available.
#2  0x0044beb2 in isc_app_run () from /usr/lib/libisc.so.41
No symbol table info available.
#3  0xb7f2026a in main () from /usr/sbin/named
No symbol table info available.

Lots of "No symbol table info available" in the gdb output. Willing to help with
whatever, just advise me what the next step is.

Comment 33 Charles R. Anderson 2008-02-28 17:07:52 UTC

I have to retract my previous statements about this version.  I've had a couple
crashes recently.  Here is a backtrace:

(gdb) thread apply all bt

Thread 7 (process 18856):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x004c9c07 in sigsuspend () from /lib/libc.so.6
#2  0x0044beb2 in isc_app_run () at app.c:533
#3  0xb7f8d26a in main (argc=0, argv=0xbfe0cb84) at ./main.c:878

Thread 6 (process 18857):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x004905d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x00447bef in run (uap=0xb7ef7008) at task.c:822
#3  0x0048c50b in start_thread () from /lib/libpthread.so.0
#4  0x00573b2e in clone () from /lib/libc.so.6

Thread 5 (process 18858):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x00493119 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x0048e89e in _L_lock_88 () from /lib/libpthread.so.0
#3  0x0048e3aa in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x00458c25 in internal_recv (me=0xb24fe688, ev=0xb1dd9b70) at socket.c:2240
#5  0x00447d72 in run (uap=0xb7ef7008) at task.c:874
#6  0x0048c50b in start_thread () from /lib/libpthread.so.0
#7  0x00573b2e in clone () from /lib/libc.so.6
---Type <return> to continue, or q <return> to quit--- 

Thread 4 (process 18859):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x004905d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x00447bef in run (uap=0xb7ef7008) at task.c:822
#3  0x0048c50b in start_thread () from /lib/libpthread.so.0
#4  0x00573b2e in clone () from /lib/libc.so.6

Thread 3 (process 18860):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x004905d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x00447bef in run (uap=0xb7ef7008) at task.c:822
#3  0x0048c50b in start_thread () from /lib/libpthread.so.0
#4  0x00573b2e in clone () from /lib/libc.so.6

Thread 2 (process 18861):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x00490902 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#2  0x0045de0c in isc_condition_waituntil (c=0xb7ef8040, m=0xb7ef8010, 
    t=0xb7ef8038) at condition.c:59
#3  0x0044a510 in run (uap=0xb7ef8008) at timer.c:719
#4  0x0048c50b in start_thread () from /lib/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#5  0x00573b2e in clone () from /lib/libc.so.6

Thread 1 (process 18862):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x004c9690 in raise () from /lib/libc.so.6
#2  0x004caf91 in abort () from /lib/libc.so.6
#3  0xb7f8ca8c in assertion_failed (file=0x465b3e "socket.c", line=1839, 
    type=isc_assertiontype_insist, cond=0x465b87 "!sock->pending_recv")
    at ./main.c:159
#4  0x0045bf20 in watcher (uap=0xb8210238) at socket.c:1839
#5  0x0048c50b in start_thread () from /lib/libpthread.so.0
#6  0x00573b2e in clone () from /lib/libc.so.6

Comment 34 Adam Tkac 2008-03-10 11:12:15 UTC

(In reply to comment #32)
> Lots of "No symbol table info available" in the gdb output. Willing to help with
> whatever, just advise me what the next step is.

You have to install bind-debuginfo package which contains debug symbols (make
sure that you have same bind and bind-debuginfo). With debuginfo installed you
can get usable debugging information.

Comment 35 Adam Tkac 2008-03-10 11:15:21 UTC

(In reply to comment #33)
> I have to retract my previous statements about this version.  I've had a couple
> crashes recently.  Here is a backtrace:

Is backtrace from master or slave and could it be related to zone transfers? It
seems like different problem than 25.b2 fixes for me.

Comment 36 Charles R. Anderson 2008-03-10 16:51:08 UTC

It is from the master.  I don't know what it might be related to.  I don't see
the same "segfault" log entry in /var/log/messages at the time the core file was
generated, but I do see these messages at the time of the core file generation:

Feb 28 11:07:03 master named[18856]: client w.x.y.z#62589: view global: signer
"foo.dhcp" approved
Feb 28 11:07:03 master named[18856]: zone foo.domain/IN/global: sending notifies
(serial 2010784390)
Feb 28 11:07:03 master named[18856]: client w.x.y.x#62590: view global: signer
"x.w.in-addr.arpa.dhcp" approved
Feb 28 11:07:03 master named[18856]: zone x.w.IN-ADDR.ARPA/IN/global: sending
notifies (serial 2010693532)
Feb 28 11:07:03 master named[18856]: socket.c:1839: INSIST(!sock->pending_recv)
failed
Feb 28 11:07:03 master named[18856]: exiting (due to assertion failure)

Comment 37 Adam Tkac 2008-04-17 12:17:48 UTC

(In reply to comment #36)
This is different issue than original segfault. Would it be possible test
http://koji.fedoraproject.org/koji/taskinfo?taskID=570357 please? Problem should
be fixed there. Thanks

Comment 38 Charles R. Anderson 2008-04-21 20:12:21 UTC

Ok, I've loaded that build and will be watching it over the next week to see if
it solves the problem.  Thanks.

Comment 39 Adam Tkac 2008-04-23 13:09:44 UTC

(In reply to comment #38)
Thank you for your help with testing

Comment 40 Fedora Update System 2008-04-28 14:57:11 UTC

bind-9.5.0-26.b3.fc8 has been submitted as an update for Fedora 8

Comment 41 Adam Tkac 2008-04-28 15:00:21 UTC

Patch for issue written in comment #36 seems right (one week without crash) so
let's put it to updates-testing.

Comment 42 Adam Tkac 2008-04-30 09:10:47 UTC

Hm, update system doesn't write message here so adding it manually:

bind-9.5.0-26.b3.fc8 has been pushed to the Fedora 8 testing repository.  If
problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update bind'.  You can provide feedback
for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-3402

Comment 43 Charles R. Anderson 2008-05-08 19:26:42 UTC

I've had no crashes since bind-9.5.0-25.3.b2.fc8, and bind-9.5.0-26.b3.fc8 has
been running fine as well since April 30.  +1 from me for pushing this to updates.

Comment 44 Fedora Update System 2008-05-10 13:50:03 UTC

bind-9.5.0-26.b3.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 45 Kapetanakis Giannis 2008-05-12 21:03:02 UTC

I updated today to bind-9.5.0-26.b3.fc8 and had 5 crashes:

May 12 14:40:11 server named[2665]: notify: info: client xxx.xxx.xxx.xxx#1035:
received notify for zone 'aaa.bb'
May 12 14:41:15 server named[2665]:last message repeated 4 times
May 12 14:42:16 server named[2665]:last message repeated 4 times
May 12 14:43:21 server named[2665]:last message repeated 3 times
May 12 14:44:32 server named[2665]:last message repeated 5 times
May 12 14:45:43 server named[2665]:last message repeated 4 times
May 12 14:46:41 server named[2665]:last message repeated 2 times
May 12 14:46:41 server kernel: named[2667]: segfault at 00000000 eip 0015b1c7
esp b74dbf50 error 4

Always like this. After the notify for the specific zone I get the
segfault. I run slave for this domain but I have plenty master zones.

I downgraded to bind-9.5.0-23.b1.fc8
(which was quite difficult for me to find it... :-/ )

I didn't get a segfault in 5 other servers I run bind,
but I downgraded there as well just in case.

Version is Fedora Core 8 2.6.24.5-85.fc8 i686
2 x Pentium(R) 4 3.40GHz

I don't have a core dump.

Regards,
Giannis

Comment 46 Kapetanakis Giannis 2008-05-12 21:09:04 UTC

just for the record remote server was 
VERSION.BIND    text = "8.2.2-P5"

:(( It's quite old and I cannot do anything about it...

Comment 47 Adam Tkac 2008-05-13 07:22:20 UTC

Without core dump I can do nothing. Would it be possible get it, please?
(backtrace is sufficient)

- make /var/named writable by named user (chmod/chown + SELinux set correctly).
You should not see message "the working directory is not writable" in log
- wait for crash
- run "gdb /usr/sbin/named <core_file>" (make sure you have bind-debuginfo
installed)
- attach output when you type "t a a bt"

Thanks

Comment 48 Mark 2008-05-21 19:01:51 UTC

I have the same issue with Fedora 8, bind-9.5.0-26.b3.fc8, and bind-9.5.0-
29.b2.fc8 (Fedora 9 srpm built on Fedora 8).

I have the following installed:
bind-9.5.0-29.b2.fc8
bind-chroot-9.5.0-29.b2.fc8
bind-debuginfo-9.5.0-29.b2.fc8
bind-devel-9.5.0-29.b2.fc8
bind-libs-9.5.0-29.b2.fc8
bind-utils-9.5.0-29.b2.fc8

Log Message:
kernel: named[29341]: segfault at 00000000 eip 0015ae27 esp b7f280c0 error 4

GDB output on corefile:
GNU gdb Red Hat Linux (6.6-45.fc8rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...
Using host libthread_db library "/lib/libthread_db.so.1".

warning: Can't read pathname for load map: Input/output error.
Loaded symbols for /usr/sbin/named
Reading symbols from /usr/lib/liblwres.so.40.0.0...Reading symbols from /usr/
lib/debug/usr/lib/liblwres.so.40.0.0.debug...done.
done.
Loaded symbols for /usr/lib/liblwres.so.40.0.0
Reading symbols from /usr/lib/libdns.so.41.0.0...Reading symbols from /usr/lib/
debug/usr/lib/libdns.so.41.0.0.debug...done.
done.
Loaded symbols for /usr/lib/libdns.so.41.0.0
Reading symbols from /usr/lib/libbind9.so.40.0.1...Reading symbols from /usr/
lib/debug/usr/lib/libbind9.so.40.0.1.debug...done.
done.
Loaded symbols for /usr/lib/libbind9.so.40.0.1
Reading symbols from /usr/lib/libisccfg.so.40.0.1...Reading symbols from /usr/
lib/debug/usr/lib/libisccfg.so.40.0.1.debug...done.
done.
Loaded symbols for /usr/lib/libisccfg.so.40.0.1
Reading symbols from /usr/lib/libgssapi_krb5.so.2...done.
Loaded symbols for /usr/lib/libgssapi_krb5.so.2
Reading symbols from /lib/libcrypto.so.6...done.
Loaded symbols for /lib/libcrypto.so.6
Reading symbols from /usr/lib/libisccc.so.40.0.0...Reading symbols from /usr/
lib/debug/usr/lib/libisccc.so.40.0.0.debug...done.
done.
Loaded symbols for /usr/lib/libisccc.so.40.0.0
Reading symbols from /usr/lib/libisc.so.41.0.0...Reading symbols from /usr/lib/
debug/usr/lib/libisc.so.41.0.0.debug...done.
done.
Loaded symbols for /usr/lib/libisc.so.41.0.0
Reading symbols from /lib/libcap.so.1...done.
Loaded symbols for /lib/libcap.so.1
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /usr/lib/libxml2.so.2...done.
Loaded symbols for /usr/lib/libxml2.so.2
Reading symbols from /lib/libz.so.1...done.
Loaded symbols for /lib/libz.so.1
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /usr/lib/libkrb5.so.3...done.
Loaded symbols for /usr/lib/libkrb5.so.3
Reading symbols from /usr/lib/libk5crypto.so.3...done.
Loaded symbols for /usr/lib/libk5crypto.so.3
Reading symbols from /lib/libcom_err.so.2...done.
Loaded symbols for /lib/libcom_err.so.2
Reading symbols from /usr/lib/libkrb5support.so.0...done.
Loaded symbols for /usr/lib/libkrb5support.so.0
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libkeyutils.so.1...done.
Loaded symbols for /lib/libkeyutils.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libselinux.so.1...done.
Loaded symbols for /lib/libselinux.so.1
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Core was generated by `/usr/sbin/named -u named -t /var/named/chroot'.
Program terminated with signal 11, Segmentation fault.
#0  0x0015ae27 in dns_acl_match (reqaddr=0xb7e421d4, reqsigner=0x0, 
acl=0xb6025248, env=0xb7e500a8, match=0xb7e421d0,
    matchelt=0x0) at acl.c:226
226                     if (*(isc_boolean_t *) node->data[ISC_IS6(family)] == 
ISC_TRUE)
(gdb) quit

Comment 49 Mark 2008-05-21 22:46:11 UTC

And the backtrace (oops):
(gdb) t a a bt

Thread 4 (process 31153):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x0063fc07 in sigsuspend () from /lib/libc.so.6
#2  0x0044c962 in isc_app_run () at app.c:533
#3  0xb7f9496a in main (argc=0, argv=0xbfa3e2c4) at ./main.c:878

Thread 3 (process 31155):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x00494902 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/
libpthread.so.0
#2  0x0045eddc in isc_condition_waituntil (c=0xb7f1a040, m=0xb7f1a010, 
    t=0xb7f1a038) at condition.c:59
#3  0x0044afc0 in run (uap=0xb7f1a008) at timer.c:719
#4  0x0049050b in start_thread () from /lib/libpthread.so.0
#5  0x006e9b2e in clone () from /lib/libc.so.6

Thread 2 (process 31156):
#0  0x0012d402 in __kernel_vsyscall ()
#1  0x006e25f1 in select () from /lib/libc.so.6
#2  0x0045c42d in watcher (uap=0xb8c10128) at socket.c:2538
#3  0x0049050b in start_thread () from /lib/libpthread.so.0
#4  0x006e9b2e in clone () from /lib/libc.so.6

Thread 1 (process 31154):
#0  0x0015ae27 in dns_acl_match (reqaddr=0xb7f101d4, reqsigner=0x0, 
    acl=0xb60f3248, env=0xb7f1e0a8, match=0xb7f101d0, matchelt=0x0) at acl.c:226
#1  0xb7f84f39 in ns_client_checkaclsilent (client=0xb5e04008, sockaddr=0x0, 
    acl=0xb60f3248, default_allow=isc_boolean_true) at client.c:2499
#2  0xb7f95ed7 in query_validatezonedb (client=0xb5e04008, name=0xb5e0b008, 
    qtype=15, options=0, zone=0xb8daa4e0, db=0xb5ac71d8, versionp=0xb7f10a0c)
    at query.c:643
#3  0xb7f9641e in query_getzonedb (client=0xb5e04008, name=0xb5e0b008, 
qtype=15, 
    options=0, zonep=0xb7f10a08, dbp=0xb7f10a44, versionp=0xb7f10a0c)
    at query.c:742
#4  0xb7f96487 in query_getdb (client=0xb5e04008, name=0xb5e0b008, 
    qtype=<value optimized out>, options=0, zonep=0xb7f10a08, dbp=0xb7f10a44, 
    versionp=0xb7f10a0c, is_zonep=0xb7f10a18) at query.c:940
#5  0xb7f9a469 in query_find (client=0xb5e04008, event=0x0, qtype=15)
    at query.c:3456
#6  0xb7f9d9aa in ns_query_start (client=0xb5e04008) at query.c:4617
#7  0xb7f8a80f in client_request (task=0xb7f2aa98, event=0xb5e080f8)
    at client.c:1783
#8  0x00448822 in run (uap=0xb7f18008) at task.c:874
Missing separate debuginfos, use: debuginfo-install e2fsprogs.i386 glibc.i686 
keyutils.i386 krb5.i386 libcap.i386 libselinux.i386 libxml2.i386 openssl.i686 
zlib.i386
---Type <return> to continue, or q <return> to quit---
#9  0x0049050b in start_thread () from /lib/libpthread.so.0
#10 0x006e9b2e in clone () from /lib/libc.so.6

Comment 50 Adam Tkac 2008-05-22 08:53:17 UTC

Problem with 9.5.0-26.b3 is tracked as bug #446848. I added all impacted persons
to CC of that bug.