Description of problem: Intermittently, bind crashes with the following in the log file: -------------------------------- Nov 20 06:48:03 beauty named[14897]: socket.c:1674: INSIST(!sock->pending_send) failed Nov 20 06:48:03 beauty named[14897]: exiting (due to assertion failure) -------------------------------- Version-Release number of selected component (if applicable): 9.3.3-6.fc6 How reproducible: Intermittent. Steps to Reproduce: 1. Run bind. 2. Wait for it to crash? Actual results: Bind crashes intermittently. Expected results: Shouldn't crash. Additional info: This is bind in chroot environment under SELinux (targeted). It is running two views.
Just had another one. It happened right after I brought up an IPSec tunnel on this box. Somehow, I think previous crashes may have been around the same time. Not sure if it's related...
Could you please attach an output of named when it starts? from /var/log/messages.
Nov 21 20:50:36 beauty named[3775]: starting BIND 9.3.3rc3 -u named -t /var/name d/chroot Nov 21 20:50:36 beauty named[3775]: found 1 CPU, using 1 worker thread Nov 21 20:50:36 beauty named[3775]: loading configuration from '/etc/named.conf' Nov 21 20:50:36 beauty named[3775]: listening on IPv4 interface lo, 127.0.0.1#53 Nov 21 20:50:36 beauty named[3775]: listening on IPv4 interface lo:0, 203.171.74 .242#53 Nov 21 20:50:36 beauty named[3775]: listening on IPv4 interface bond0, 172.27.0. 1#53 Nov 21 20:50:36 beauty named[3775]: command channel listening on 127.0.0.1#953 Nov 21 20:50:36 beauty named[3775]: zone 0.in-addr.arpa/IN/internal: loaded seri al 42 Nov 21 20:50:36 beauty named[3775]: zone 0.0.127.in-addr.arpa/IN/internal: loade d serial 1997022700 Nov 21 20:50:36 beauty named[3775]: zone 27.172.in-addr.arpa/IN/internal: loaded serial 2006071500 Nov 21 20:50:36 beauty named[3775]: zone 255.in-addr.arpa/IN/internal: loaded se rial 42 Nov 21 20:50:36 beauty named[3775]: zone 0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 .0.0.0.0.0.0.0.0.0.0.0.ip6.arpa/IN/internal: loaded serial 1997022700 Nov 21 20:50:36 beauty named[3775]: zone rexursive.com/IN/internal: loaded seria l 2006083000 Nov 21 20:50:36 beauty named[3775]: zone dev/IN/internal: loaded serial 20060315 00 Nov 21 20:50:36 beauty named[3775]: zone localdomain/IN/internal: loaded serial 42 Nov 21 20:50:36 beauty named[3775]: zone localhost/IN/internal: loaded serial 42 Nov 21 20:50:36 beauty named[3775]: zone rexursive.com.au/IN/external: loaded se rial 2006083001 Nov 21 20:50:36 beauty named[3775]: zone rexursive.com/IN/external: loaded seria l 2006083001 Nov 21 20:50:36 beauty named[3775]: zone rexursive.net/IN/external: loaded seria l 2006083001 Nov 21 20:50:36 beauty named[3775]: zone rexursive.org/IN/external: loaded seria l 2006083001 Nov 21 20:50:36 beauty named[3775]: running Nov 21 20:50:36 beauty named[3775]: zone 27.172.in-addr.arpa/IN/internal: sendin g notifies (serial 2006071500) Nov 21 20:50:36 beauty named[3775]: zone rexursive.com/IN/internal: sending noti fies (serial 2006083000) Nov 21 20:50:36 beauty named[3775]: zone rexursive.org/IN/external: sending noti fies (serial 2006083001) Nov 21 20:50:36 beauty named[3775]: zone rexursive.com.au/IN/external: sending n otifies (serial 2006083001) Nov 21 20:50:36 beauty named[3775]: zone dev/IN/internal: sending notifies (seri al 2006031500) Nov 21 20:50:36 beauty named[3775]: zone rexursive.com/IN/external: sending noti fies (serial 2006083001) Nov 21 20:50:36 beauty named[3775]: zone rexursive.net/IN/external: sending noti fies (serial 2006083001) Nov 21 20:50:36 beauty named[3775]: client 172.27.0.1#32824: view internal: rece ived notify for zone '27.172.in-addr.arpa' Nov 21 20:50:37 beauty named[3775]: client 172.27.0.1#32824: view internal: rece ived notify for zone 'rexursive.com' Nov 21 20:50:37 beauty named[3775]: client 172.27.0.1#32824: view internal: rece ived notify for zone 'dev'
Had a few more crashes and it seems it is related somehow to bringing up that IPSec tunnel. More on the tunnel itself in bug #217306.
Could you please attach a log from named? I need to figure out what happened before the crash. You can add this line to /etc/sysconfig/named: OPTIONS="-d 10" and when bind crash attach /var/named/data/named.run here. Caution, this file can be quite big. btw. your bind configuration could be quite helpful, too.
Did the OPTIONS="-d 10", but that produced nothing. Maybe because all logging goes to syslog and debugging messages are ignored? Bind did crash in the middle of bringing up IPSec tunnel this morning. I also tried this: /usr/sbin/named -fg -d 10 -u named -t /var/named/chroot 2>/root/named.log I couldn't reproduce the problem any more when bringing tunnel down/up. I'll try again tomorrow. I'm attaching the config file.
Created attachment 142529 [details] Configuration file
A similar thing, seen before: http://bugzilla.mandriva.com/show_bug.cgi?id=8319 It is marked as FIXED, but reading the bug report would suggest that the bug isn't triggered because the users stopped running that setup.
I'll be away for about 10 days now, so I'll try to collect a bit more information when I come back.
Tried to replicate with -d 10 turned on. Even with *.debug going to /var/log/messages, there is nothing more in the system logs. Weird. There is no trace of suggested named.run file anywhere. Any other suggestions? And, I cannot replicate this at will. It appears to crash when the IPSec tunnel comes up for the first time that day. At that point named has been running for a while (overnight).
Oh, and forgot to mention - named did crash once more: -------------------------------------------- Dec 13 08:19:06 beauty named[9884]: socket.c:1674: INSIST(!sock->pending_send) failed Dec 13 08:19:06 beauty named[9884]: exiting (due to assertion failure) -------------------------------------------- This was again in the middle of IPSec negotiation.
Another report of the same thing, also IPSec related: http://forums.sw-soft.com/showthread.php?s=33fa21348c331c7b14175e310c5bcc43&threadid=21484&goto=nextnewest
Okay, thanks, I'll check it.
I'm now running named in foreground mode, like this: /usr/sbin/named -fg -d 10 -u named -t /var/named/chroot Already have 95 MB of logs. When it crashes next (unless it's some kind condition that doesn't get hit when logging is increased), we should know what happened. This morning when I brought the tunnel up, it continued working. Maybe tomorrow...
Created attachment 143691 [details] Last 10000 lines of the log before crash The whole log file is 169 MB and if you need it, let me know.
Ditto bind-9.3.4-2.fc6.
I'm not sure if this is bind's problem. Crash is only reproducible during IPSec negotiation? If yes, it seems like problem on IPSec layer, not in bind. -A-
You are correct. Crash is only reproducible during IPSec negotition and then intermittently. I'm seeing it still with the latest FC6 kernel (2.6.20-1.2944.fc6). May as well be a kernel bug.
Since switching this box to Fedora 7, I have not had a crash yet. I'll leave it for another month or so and if there are no crashes, I'll close.
Interesting. I read changelog and there's nothing related to this bug. Looks one of the mysterious fixes... Adam
It is also possible that: - this was really a kernel bug that got fixed - this bug is still there but other code has been reworked so it didn't get triggered yet (i.e. this bug was intermittent before - maybe it became even less frequent with the new code) Anyhow, I use this stuff almost every day and I'll report back if this failure happens again.
Aha, different error now, it would seem - a non fatal one, but still right in the middle of IPSec negotiation: -------------------------------------------- Jun 13 07:31:45 beauty racoon: INFO: initiate new phase 2 negotiation: <source_ip>[500]<=><dest_ip>[500] Jun 13 07:32:14 beauty named[1880]: socket.c:1178: unexpected error: Jun 13 07:32:14 beauty named[1880]: internal_send: 10.0.0.35#53: No such process Jun 13 07:32:14 beauty named[1880]: errno2result.c:111: unexpected error: Jun 13 07:32:14 beauty named[1880]: unable to convert errno to isc_result: 3: No such process Jun 13 07:32:15 beauty racoon: INFO: IPsec-SA expired: ESP/Tunnel <dest_ip>[0 ]-><source_ip>[0] spi=[...] -------------------------------------------- So, maybe the error handling got a bit smarter after all?
Could you please reproduce this problem with http://people.redhat.com/atkac/test_srpms/bind-9.4.1-5.1.fc7.src.rpm and attach output from log? It could print what error occurs. Thanks, Adam
I can't see this any more: bind-9.4.1-6.1.fc7, kernel-2.6.21-1.3228.fc7.
Chm, named doesn't have any fix related to this one. Closing like one of interesting fixes... Please reopen if occurs again Adam
Chalk one up for cosmic rays :-) PS. This may have been a kernel issue that got fixed somewhere along the way.