Bug 507429

Summary: named crashes with an assertion failure (fctx->queries->head is NULL)
Product: [Fedora] Fedora Reporter: Andrew McNabb <amcnabb>
Component: bindAssignee: Adam Tkac <atkac>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 11CC: atkac, carl.robben, fabrice, gkramm, hcamp, henrik, marcus, neil, nick, ovasik, p.mayers, pwouters, steve
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: bind-9.6.2-3.P1.fc12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 576906 (view as bug list) Environment:
Last Closed: 2010-04-03 04:35:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 516998, 576906    
Attachments:
Description Flags
backtrace from gdb
none
bind-9.6.2-1.fc12.x86_64 backtrace
none
bind-9.6.2-1.fc12.x86_64 debug log
none
bind-9.6.2-1.fc12.x86_64 messages
none
bind-9.6.2-1.fc12.x86_64 named.run none

Description Andrew McNabb 2009-06-22 18:14:00 UTC
Named crashes with an assertion failure.  Here are the most recent messages in /var/log/messages.

Jun 21 17:55:22 guru named[10183]: chase DS servers resolving 'mozilla.org/DS/IN': 128.187.80.20#53
Jun 21 17:55:22 guru named[10183]: not insecure resolving 'org/NS/IN': 128.187.80.20#53
Jun 21 17:55:32 guru named[10183]: network unreachable resolving 'org/DNSKEY/IN': 2001:500:48::1#53
Jun 21 17:55:52 guru named[10183]: resolver.c:2927: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed
Jun 21 17:55:52 guru named[10183]: exiting (due to assertion failure)

Please let me know if there's any other information that would be useful.

Comment 1 Adam Tkac 2009-06-23 11:45:39 UTC
Would it be possible to specify which version of bind crashed, please? (you can find it in the /var/log/messages or from `named -v` or `rpm -q bind` output)

Comment 2 Andrew McNabb 2009-06-24 18:20:43 UTC
bind-9.6.1-0.4.rc1.fc11.x86_64

Comment 3 Nick P G Smith 2009-06-24 20:48:53 UTC
I am also getting the same problem (crashes regularly every few mins after a bind restart) after a yum update to bind-9.6.1-0.4.rc1.fc11.i586

Jun 24 13:37:58 pooh named[3345]: not insecure resolving 'org/NS/IN': 87.194.0.66#53
Jun 24 13:37:58 pooh named[3345]: success resolving 'tor1.ns.ahbl.org/AAAA' (in 'ahbl.org'?) after disabling EDNS
Jun 24 13:37:58 pooh named[3345]: success resolving 'org/NS' (in 'org'?) after disabling EDNS
Jun 24 13:37:58 pooh named[3345]: resolver.c:2927: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed
Jun 24 13:37:58 pooh named[3345]: exiting (due to assertion failure)

Comment 4 Nick P G Smith 2009-06-27 23:30:28 UTC
Update: rolling back to the original F11 RPMs didn't help. So I updated again, and set:

dnssec-enable no;
dnssec-validation no;

in the named.conf. No crashes for a couple of days now, so looks much more stable.

Comment 5 Adam Tkac 2009-06-29 09:52:53 UTC
Would it be possible to attach backtrace, please? You can obtain it this way:

1. $ chmod g+w /var/named/ # Temporary allow /var/named/ to be writable by named user
2. wait for crash
3. $ debuginfo-install bind
4. gdb /usr/sbin/named /var/named/core.*
5. in gdb command line run "t a a bt full"
6. attach output here
7. chmod g-w /var/named/

Thanks.

Comment 6 Andrew McNabb 2009-06-29 15:39:15 UTC
I had another crash this morning before I had a chance to do the chmod you recommended.  This time, there was a different assertion failure:

Jun 29 09:03:01 guru named[8529]: resolver.c:2925: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed

Should I open a separate ticket for this one?

Comment 7 Adam Tkac 2009-06-29 15:48:29 UTC
(In reply to comment #6)
> I had another crash this morning before I had a chance to do the chmod you
> recommended.  This time, there was a different assertion failure:
> 
> Jun 29 09:03:01 guru named[8529]: resolver.c:2925: REQUIRE(fctx->state ==
> fetchstate_done || fctx->state == fetchstate_init) failed
> 
> Should I open a separate ticket for this one?  

I don't think so, in my opinion both crashes are related.

Comment 8 Andrew McNabb 2009-06-29 19:09:13 UTC
Created attachment 349854 [details]
backtrace from gdb

I am attaching a backtrace corresponding to the following assertion error:

resolver.c:2925: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed

Comment 9 Adam Tkac 2009-07-02 14:35:43 UTC
(In reply to comment #8)
> Created an attachment (id=349854) [details]
> backtrace from gdb
> 
> I am attaching a backtrace corresponding to the following assertion error:
> 
> resolver.c:2925: REQUIRE(fctx->state == fetchstate_done || fctx->state ==
> fetchstate_init) failed  

Thanks, reported to upstream developers.

Comment 10 Neil Neely 2009-07-19 21:38:11 UTC
I have just hit an error that I believe is the same.  These are from fully patched and current CentOS 5.3 servers both running bind version 9.3.4-P1, specifically the package:
bind.i386             30:9.3.4-10.P1.el5 

Jul 19 14:11:10 ns101 named[11594]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed

Jul 19 14:08:50 ns201 named[22082]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed

Assuming this isn't a CentOS introduced bug this seems likely to be impacting RHEL 5.3 as well.

Comment 11 Neil Neely 2009-07-20 15:01:24 UTC
Additional Crashes:

Jul 20 02:32:28 ns101 named[26185]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed

Jul 20 01:16:47 ns201 named[28201]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed
Jul 20 02:05:15 ns201 named[23728]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed
Jul 20 03:36:56 ns201 named[25225]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed

Is there a poisoned query that can cause this version to crash out in the wild?  These installations have been rock solid for quite some time.


Haven't trapped a core file yet, will look at doing so.

Comment 12 Adam Tkac 2010-01-11 13:25:24 UTC
It seems this issue is already fixed in the latest version of bind (bind-9.6.1-7.P2.fc11). If isn't, please reopen this bug.

Comment 13 Steven Pritchard 2010-02-05 00:55:55 UTC
I'm getting "resolver.c:3032: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed" on a heavily loaded system recently updated to bind-9.6.1-9.P3.fc11.x86_64.

Comment 14 Carl Robben 2010-02-09 16:49:39 UTC
Encountered this bug this morning, bind version is bind-9.6.1-16.P3.fc12.x86_64, running on kernel 2.6.31.12-174.2.3.fc12.x86_64

Comment 15 gkramm 2010-02-15 18:21:50 UTC
I have also encountered this bug this morning

running:
kernel: 2.6.31.12-174.2.3.fc12.x86_64
bind: bind-9.6.1-16.P3.fc12.x86_64

Comment 16 Phil Mayers 2010-02-17 09:29:57 UTC
We locally re-compile the FC12 RPM for RHEL5, and I just saw this bug. The RPM was the F12 upstream:

bind-9.6.1-15.P3

.el5, obviously.

We have 3 other identical servers that didn't crash, and these servers have been running unchanged under this configuration for weeks, under heavy load.

Is there an upstream bug?

Comment 17 Phil Mayers 2010-02-17 09:38:04 UTC
Sorry, I should add that I saw the:

17-Feb-2010 00:39:44.880 general: resolver.c:3032: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed
17-Feb-2010 00:39:44.881 general: exiting (due to assertion failure)

...error

Comment 18 fabris 2010-03-12 12:47:51 UTC
Hi all

Same to me :

Mar 12 00:25:34 named[1125]: resolver.c:3032: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed
Mar 12 00:25:34 named[1125]: exiting (due to assertion failure)

maybe usefull other information :

starting BIND 9.6.1-P3-RedHat-9.6.1-10.P3.fc11 -u named
named[18629]: built with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var' '--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static' '--disable-openssl-version-check' '--with-dlz-ldap=yes' '--with-dlz-postgres=yes' '--with-dlz-mysql=yes' '--with-dlz-filesystem=yes' '--with-gssapi=yes' '--disable-isc-spnego' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS= -DDIG_SIGCHASE'


Hope this help solve this problem
regards,
Fabrice

Comment 19 Henrik Nordström 2010-03-21 21:46:19 UTC
Same fetchstate_done problem in F12 bind-9.6.2-1.fc12.x86_64

Last messages:

Mar 21 17:38:47 localhost named[6046]: not insecure resolving '130.in-addr.arpa/DNSKEY/IN': 208.79.99.8#53
Mar 21 17:38:50 localhost named[6046]: network unreachable resolving '130.in-addr.arpa/DNSKEY/IN': 2001:500:14:6050:ad::1#53
Mar 21 17:38:50 localhost named[6046]: success resolving '130.in-addr.arpa/NS' (in '130.in-addr.arpa'?) after reducing the advertised EDNS UDP packet size to 512 octets
Mar 21 17:38:50 localhost named[6046]: resolver.c:3059: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed

Comment 20 Henrik Nordström 2010-03-21 22:34:54 UTC
Addendum to the instructions on enabling core dumps. On F12 selinux by default blocks the core dump even if directory made writeable.

so in addition to chmod yu also need to run

  setsebool named_write_master_zones 1

Backtrace: (bind-9.6.2-1.fc12.x86_64, Fedora 12 x86_64 fully up to date)

#0  0x00007f9813c126c5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
        resultvar = 0
        pid = <value optimized out>
        selftid = <value optimized out>
#1  0x00007f9813c13ea5 in abort () at abort.c:92
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7f98163c52a0 <ns_g_lctx>, sa_sigaction = 0x7f98163c52a0 <ns_g_lctx>}, sa_mask = {__val = {140291175560674, 140291184808624, 3059, 
              140291175554704, 140291159349292, 18446744069414584320, 0, 140291182626544, 140291057735952, 206158430248, 140291057736176, 140291057735984, 140291057735984, 
              206158430248, 140291057736208, 140291057736016}}, sa_flags = 347592748, sa_restorer = 0x7f9800000000}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f9816177fdc in assertion_failed (file=0x7f9815af35e2 "resolver.c", line=3059, type=<value optimized out>, 
    cond=0x7f9815af1e90 "fctx->state == fetchstate_done || fctx->state == fetchstate_init") at ./main.c:162
No locals.
#3  0x00007f9815a7aa94 in fctx_destroy (fctx=0x7f9807136010) at resolver.c:3058
        res = <value optimized out>
        bucketnum = <value optimized out>
        sa = <value optimized out>
        next_sa = <value optimized out>
#4  0x00007f9815a85f48 in resume_dslookup (task=<value optimized out>, event=0x0) at resolver.c:5851
        fevent = <value optimized out>
        res = 0x7f9816104180
        fctx = 0x7f9807136010
        result = <value optimized out>
        bucket_empty = isc_boolean_false
        locked = isc_boolean_false
        bucketnum = <value optimized out>
        nameservers = {magic = 1145983826, methods = 0x0, link = {prev = 0xffffffffffffffff, next = 0xffffffffffffffff}, rdclass = 0, type = 0, ttl = 0, trust = 0, covers = 0, 
          attributes = 0, count = 4294967295, resign = 0, private1 = 0x0, private2 = 0x0, private3 = 0x0, privateuint4 = 0, private5 = 0x0, private6 = 0x0, private7 = 0x7f980ea95c70}
        fixed = {name = {magic = 0, ndata = 0x7f9813c75a0a "=\001\360\377\377s\005H\203\304\b\303H\213\r\203\325-", length = 245980064, labels = 32664, attributes = 347757119, 
            offsets = 0x4ba69d25 <Address 0x4ba69d25 out of bounds>, buffer = 0x7703e, link = {prev = 0x7f9807f23b50, next = 0x7f9814b92690}, list = {head = 0x7f9815af365b, 
              tail = 0x7f9807956410}}, 
          offsets = "\001\000\000\000\000\000\000\000\224E\267\024\230\177\000\000\004\000\000\000\002\000\000\000\020p\020\026\230\177\000\000\002\000\000\000\000\000\000\000\020`\020\026\230\177\000\000\002\000\000\000\000\000\000\000\021\017\271\024\230\177\000\000\370E\360\a\230\177\000\000\220\\\251\016\230\177\000\000\000\000\000\000\000\000\000\000\033\021\271\024\230\177\000\000\020d\225\a\230\177", '\000' <repeats 18 times>, "\001\000\000\000\000\000\000", buffer = {magic = 245980312, base = 0x7f9815a7d1af, length = 127238168, 
            used = 32664, current = 363804733, active = 32664, link = {prev = 0x1d0e72304ba69d25, next = 0x7f9807956410}, mctx = 0x1}, 
          data = "<\377\267\024\230\177\000\000\000\000\000\000\000\000\000\000P\342x\027\230\177\000\000H\234\360\a\230\177\000\000P\342x\027\230\177\000\000H\234\360\a\230\177\000\000\270\000\000\000\000\000\000\000\255i\272\024\230\177\000\000\214\017\270\024\230\177\000\000h]\251\016\230\177\000\000H\234\360\a\230\177\000\000\300\355\321\025\230\177\000\000H\234\360\a\230\177\000\000\060\275\362\a\230\177\000\000\354\065\267\024\230\177\000\000\000\000\000\000\000\000\000\000\020d\225\a\230\177\000\000\000\000\000\000\000\000\000\000KB\250\025\230\177\000\000\000\000\000\000\000\000\000\000m;\257\025\230\177\000\000\200>\360\a\230\177\000\000\210]\251\016\230\177\000\000\200>\360\a\230\177\000\000\240g\350\a\230\177\000\000p]\251\016\230\177\000\000D\367\377\324\377\177\000\000\260]\251\016\230\177\000\000\000\000\000\000\000\000\000\000\220]\251\016\230\177\000\000\354\367\377\324\377\177\000\000`F\360\a\230\177"}
        domain = <value optimized out>
#5  0x00007f9814b9086c in dispatch (uap=0x7f9816105010) at task.c:862
        dispatch_count = 0
        done = isc_boolean_false
        finished = isc_boolean_false
        requeue = isc_boolean_false
        event = 0x7f9807f2bd30
        task = <value optimized out>
#6  run (uap=0x7f9816105010) at task.c:1005
        manager = 0x7f9816105010
#7  0x00007f9814749a3a in start_thread (arg=<value optimized out>) at pthread_create.c:297
        __res = <value optimized out>
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140291057739536, 928468285281373604, 140736765441968, 0, 0, 3, -877487593660772956, -877432177704727132}, mask_was_saved = 0}}, 
          priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        freesize = <value optimized out>
#8  0x00007f9813cbe67d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
No locals.

Comment 21 Henrik Nordström 2010-03-21 23:41:53 UTC
Created attachment 401612 [details]
bind-9.6.2-1.fc12.x86_64 backtrace

Full backtrace of all threads attached.

Comment 22 Adam Tkac 2010-03-22 15:16:29 UTC
(In reply to comment #21)
> Created an attachment (id=401612) [details]
> bind-9.6.2-1.fc12.x86_64 backtrace
> 
> Full backtrace of all threads attached.    

Thanks for it. Unfortunately I'm still not able to find circumstaces when assertion is triggered. Would it be possible to get more debug information this way, please?

1. stop named
2. rm -f /var/named/data/named.debug*
3. add following to your named.conf (to logging statement):

logging {
        channel debug_channel {
                file "/var/named/data/named.debug" versions 1 size 2M;
                severity debug 3;
        };
        category resolver { debug_channel; };
        category dnssec { debug_channel; };
};

4. start named and wait till it crashes
5. attach named.debug and named.debug.0 files.

Thank you.

Comment 23 Henrik Nordström 2010-03-22 20:20:26 UTC
Created attachment 401863 [details]
bind-9.6.2-1.fc12.x86_64 debug log

Comment 24 Henrik Nordström 2010-03-22 20:20:56 UTC
Created attachment 401865 [details]
bind-9.6.2-1.fc12.x86_64 messages

Comment 25 Henrik Nordström 2010-03-22 20:21:35 UTC
Created attachment 401866 [details]
bind-9.6.2-1.fc12.x86_64 named.run

Comment 26 Henrik Nordström 2010-03-22 20:25:22 UTC
There was no named.debug.0 file. But I attached messages and named.run instead which may help..

I can usually reproduce the error in some minutes. But not always. 

Other thing observed is that it's quite slow responding. Not sure if that's due to the DNSSec overhead or simply because the internet link is very congested at the moment.

Comment 27 Adam Tkac 2010-03-23 11:46:28 UTC
(In reply to comment #26)
> There was no named.debug.0 file. But I attached messages and named.run instead
> which may help..

Thanks for your feedback, I think I finally figured where is the problem. Would it be possible to test "bind-9.6.2-1.fc12.2.rh507429" from http://kojiweb.fedoraproject.org/koji/taskinfo?taskID=2069489, please? Use same debug configuration as before.

Test package should crash like unpatched one when it hits this issue but will write a message to log that indicates the bug was triggered. On the other hand, if my theory about this bug is not correct, named crashes as well and appropriate message is written to the log. After you get information from the test package please remove it and install the distribution one.

> Other thing observed is that it's quite slow responding. Not sure if that's due
> to the DNSSec overhead or simply because the internet link is very congested at
> the moment.    

Packets with DNSSEC related records are bigger than plain DNS so it might be a reason. But generally only the first lookups are slow, before top level domains records are validated & cached, then next lookups are quite fast.

Comment 28 Henrik Nordström 2010-03-23 20:15:42 UTC
Looks promising. End of named.debug with that version says:

resquery 0x7f0ee573b010 (fctx 0x7f0ee5735410(99.79.208.in-addr.arpa/DS)): sent
Query has been resent but now is destroyed. I got you, bug #507429 (hopefully)

Comment 29 Adam Tkac 2010-03-26 13:52:54 UTC
(In reply to comment #28)
> Looks promising. End of named.debug with that version says:
> 
> resquery 0x7f0ee573b010 (fctx 0x7f0ee5735410(99.79.208.in-addr.arpa/DS)): sent
> Query has been resent but now is destroyed. I got you, bug #507429 (hopefully)    

Thanks for your participation. I will prepare a patch and submit an updated package to Fedora.

Comment 30 Fedora Update System 2010-03-26 15:11:28 UTC
bind-9.6.2-3.P1.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/bind-9.6.2-3.P1.fc11

Comment 31 Fedora Update System 2010-03-26 15:11:38 UTC
bind-9.6.2-3.P1.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/bind-9.6.2-3.P1.fc12

Comment 32 Fedora Update System 2010-03-30 02:14:31 UTC
bind-9.6.2-3.P1.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update bind'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/bind-9.6.2-3.P1.fc12

Comment 33 Fedora Update System 2010-03-30 02:18:12 UTC
bind-9.6.2-3.P1.fc11 has been pushed to the Fedora 11 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update bind'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/bind-9.6.2-3.P1.fc11

Comment 34 Fedora Update System 2010-04-03 04:35:19 UTC
bind-9.6.2-3.P1.fc11 has been pushed to the Fedora 11 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 35 Fedora Update System 2010-04-03 04:46:00 UTC
bind-9.6.2-3.P1.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.