Bug 507429
Summary: | named crashes with an assertion failure (fctx->queries->head is NULL) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andrew McNabb <amcnabb> | ||||||||||||
Component: | bind | Assignee: | Adam Tkac <atkac> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | low | ||||||||||||||
Version: | 11 | CC: | atkac, carl.robben, fabrice, gkramm, hcamp, henrik, marcus, neil, nick, ovasik, p.mayers, pwouters, steve | ||||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | bind-9.6.2-3.P1.fc12 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | |||||||||||||||
: | 576906 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2010-04-03 04:35:33 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 516998, 576906 | ||||||||||||||
Attachments: |
|
Description
Andrew McNabb
2009-06-22 18:14:00 UTC
Would it be possible to specify which version of bind crashed, please? (you can find it in the /var/log/messages or from `named -v` or `rpm -q bind` output) bind-9.6.1-0.4.rc1.fc11.x86_64 I am also getting the same problem (crashes regularly every few mins after a bind restart) after a yum update to bind-9.6.1-0.4.rc1.fc11.i586 Jun 24 13:37:58 pooh named[3345]: not insecure resolving 'org/NS/IN': 87.194.0.66#53 Jun 24 13:37:58 pooh named[3345]: success resolving 'tor1.ns.ahbl.org/AAAA' (in 'ahbl.org'?) after disabling EDNS Jun 24 13:37:58 pooh named[3345]: success resolving 'org/NS' (in 'org'?) after disabling EDNS Jun 24 13:37:58 pooh named[3345]: resolver.c:2927: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed Jun 24 13:37:58 pooh named[3345]: exiting (due to assertion failure) Update: rolling back to the original F11 RPMs didn't help. So I updated again, and set: dnssec-enable no; dnssec-validation no; in the named.conf. No crashes for a couple of days now, so looks much more stable. Would it be possible to attach backtrace, please? You can obtain it this way: 1. $ chmod g+w /var/named/ # Temporary allow /var/named/ to be writable by named user 2. wait for crash 3. $ debuginfo-install bind 4. gdb /usr/sbin/named /var/named/core.* 5. in gdb command line run "t a a bt full" 6. attach output here 7. chmod g-w /var/named/ Thanks. I had another crash this morning before I had a chance to do the chmod you recommended. This time, there was a different assertion failure: Jun 29 09:03:01 guru named[8529]: resolver.c:2925: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed Should I open a separate ticket for this one? (In reply to comment #6) > I had another crash this morning before I had a chance to do the chmod you > recommended. This time, there was a different assertion failure: > > Jun 29 09:03:01 guru named[8529]: resolver.c:2925: REQUIRE(fctx->state == > fetchstate_done || fctx->state == fetchstate_init) failed > > Should I open a separate ticket for this one? I don't think so, in my opinion both crashes are related. Created attachment 349854 [details]
backtrace from gdb
I am attaching a backtrace corresponding to the following assertion error:
resolver.c:2925: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed
(In reply to comment #8) > Created an attachment (id=349854) [details] > backtrace from gdb > > I am attaching a backtrace corresponding to the following assertion error: > > resolver.c:2925: REQUIRE(fctx->state == fetchstate_done || fctx->state == > fetchstate_init) failed Thanks, reported to upstream developers. I have just hit an error that I believe is the same. These are from fully patched and current CentOS 5.3 servers both running bind version 9.3.4-P1, specifically the package: bind.i386 30:9.3.4-10.P1.el5 Jul 19 14:11:10 ns101 named[11594]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed Jul 19 14:08:50 ns201 named[22082]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed Assuming this isn't a CentOS introduced bug this seems likely to be impacting RHEL 5.3 as well. Additional Crashes: Jul 20 02:32:28 ns101 named[26185]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed Jul 20 01:16:47 ns201 named[28201]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed Jul 20 02:05:15 ns201 named[23728]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed Jul 20 03:36:56 ns201 named[25225]: resolver.c:2441: REQUIRE((((fctx->queries).head == ((void *)0)) ? isc_boolean_true : isc_boolean_false)) failed Is there a poisoned query that can cause this version to crash out in the wild? These installations have been rock solid for quite some time. Haven't trapped a core file yet, will look at doing so. It seems this issue is already fixed in the latest version of bind (bind-9.6.1-7.P2.fc11). If isn't, please reopen this bug. I'm getting "resolver.c:3032: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed" on a heavily loaded system recently updated to bind-9.6.1-9.P3.fc11.x86_64. Encountered this bug this morning, bind version is bind-9.6.1-16.P3.fc12.x86_64, running on kernel 2.6.31.12-174.2.3.fc12.x86_64 I have also encountered this bug this morning running: kernel: 2.6.31.12-174.2.3.fc12.x86_64 bind: bind-9.6.1-16.P3.fc12.x86_64 We locally re-compile the FC12 RPM for RHEL5, and I just saw this bug. The RPM was the F12 upstream: bind-9.6.1-15.P3 .el5, obviously. We have 3 other identical servers that didn't crash, and these servers have been running unchanged under this configuration for weeks, under heavy load. Is there an upstream bug? Sorry, I should add that I saw the: 17-Feb-2010 00:39:44.880 general: resolver.c:3032: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed 17-Feb-2010 00:39:44.881 general: exiting (due to assertion failure) ...error Hi all Same to me : Mar 12 00:25:34 named[1125]: resolver.c:3032: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed Mar 12 00:25:34 named[1125]: exiting (due to assertion failure) maybe usefull other information : starting BIND 9.6.1-P3-RedHat-9.6.1-10.P3.fc11 -u named named[18629]: built with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var' '--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static' '--disable-openssl-version-check' '--with-dlz-ldap=yes' '--with-dlz-postgres=yes' '--with-dlz-mysql=yes' '--with-dlz-filesystem=yes' '--with-gssapi=yes' '--disable-isc-spnego' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS= -DDIG_SIGCHASE' Hope this help solve this problem regards, Fabrice Same fetchstate_done problem in F12 bind-9.6.2-1.fc12.x86_64 Last messages: Mar 21 17:38:47 localhost named[6046]: not insecure resolving '130.in-addr.arpa/DNSKEY/IN': 208.79.99.8#53 Mar 21 17:38:50 localhost named[6046]: network unreachable resolving '130.in-addr.arpa/DNSKEY/IN': 2001:500:14:6050:ad::1#53 Mar 21 17:38:50 localhost named[6046]: success resolving '130.in-addr.arpa/NS' (in '130.in-addr.arpa'?) after reducing the advertised EDNS UDP packet size to 512 octets Mar 21 17:38:50 localhost named[6046]: resolver.c:3059: REQUIRE(fctx->state == fetchstate_done || fctx->state == fetchstate_init) failed Addendum to the instructions on enabling core dumps. On F12 selinux by default blocks the core dump even if directory made writeable. so in addition to chmod yu also need to run setsebool named_write_master_zones 1 Backtrace: (bind-9.6.2-1.fc12.x86_64, Fedora 12 x86_64 fully up to date) #0 0x00007f9813c126c5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 resultvar = 0 pid = <value optimized out> selftid = <value optimized out> #1 0x00007f9813c13ea5 in abort () at abort.c:92 save_stage = 2 act = {__sigaction_handler = {sa_handler = 0x7f98163c52a0 <ns_g_lctx>, sa_sigaction = 0x7f98163c52a0 <ns_g_lctx>}, sa_mask = {__val = {140291175560674, 140291184808624, 3059, 140291175554704, 140291159349292, 18446744069414584320, 0, 140291182626544, 140291057735952, 206158430248, 140291057736176, 140291057735984, 140291057735984, 206158430248, 140291057736208, 140291057736016}}, sa_flags = 347592748, sa_restorer = 0x7f9800000000} sigs = {__val = {32, 0 <repeats 15 times>}} #2 0x00007f9816177fdc in assertion_failed (file=0x7f9815af35e2 "resolver.c", line=3059, type=<value optimized out>, cond=0x7f9815af1e90 "fctx->state == fetchstate_done || fctx->state == fetchstate_init") at ./main.c:162 No locals. #3 0x00007f9815a7aa94 in fctx_destroy (fctx=0x7f9807136010) at resolver.c:3058 res = <value optimized out> bucketnum = <value optimized out> sa = <value optimized out> next_sa = <value optimized out> #4 0x00007f9815a85f48 in resume_dslookup (task=<value optimized out>, event=0x0) at resolver.c:5851 fevent = <value optimized out> res = 0x7f9816104180 fctx = 0x7f9807136010 result = <value optimized out> bucket_empty = isc_boolean_false locked = isc_boolean_false bucketnum = <value optimized out> nameservers = {magic = 1145983826, methods = 0x0, link = {prev = 0xffffffffffffffff, next = 0xffffffffffffffff}, rdclass = 0, type = 0, ttl = 0, trust = 0, covers = 0, attributes = 0, count = 4294967295, resign = 0, private1 = 0x0, private2 = 0x0, private3 = 0x0, privateuint4 = 0, private5 = 0x0, private6 = 0x0, private7 = 0x7f980ea95c70} fixed = {name = {magic = 0, ndata = 0x7f9813c75a0a "=\001\360\377\377s\005H\203\304\b\303H\213\r\203\325-", length = 245980064, labels = 32664, attributes = 347757119, offsets = 0x4ba69d25 <Address 0x4ba69d25 out of bounds>, buffer = 0x7703e, link = {prev = 0x7f9807f23b50, next = 0x7f9814b92690}, list = {head = 0x7f9815af365b, tail = 0x7f9807956410}}, offsets = "\001\000\000\000\000\000\000\000\224E\267\024\230\177\000\000\004\000\000\000\002\000\000\000\020p\020\026\230\177\000\000\002\000\000\000\000\000\000\000\020`\020\026\230\177\000\000\002\000\000\000\000\000\000\000\021\017\271\024\230\177\000\000\370E\360\a\230\177\000\000\220\\\251\016\230\177\000\000\000\000\000\000\000\000\000\000\033\021\271\024\230\177\000\000\020d\225\a\230\177", '\000' <repeats 18 times>, "\001\000\000\000\000\000\000", buffer = {magic = 245980312, base = 0x7f9815a7d1af, length = 127238168, used = 32664, current = 363804733, active = 32664, link = {prev = 0x1d0e72304ba69d25, next = 0x7f9807956410}, mctx = 0x1}, data = "<\377\267\024\230\177\000\000\000\000\000\000\000\000\000\000P\342x\027\230\177\000\000H\234\360\a\230\177\000\000P\342x\027\230\177\000\000H\234\360\a\230\177\000\000\270\000\000\000\000\000\000\000\255i\272\024\230\177\000\000\214\017\270\024\230\177\000\000h]\251\016\230\177\000\000H\234\360\a\230\177\000\000\300\355\321\025\230\177\000\000H\234\360\a\230\177\000\000\060\275\362\a\230\177\000\000\354\065\267\024\230\177\000\000\000\000\000\000\000\000\000\000\020d\225\a\230\177\000\000\000\000\000\000\000\000\000\000KB\250\025\230\177\000\000\000\000\000\000\000\000\000\000m;\257\025\230\177\000\000\200>\360\a\230\177\000\000\210]\251\016\230\177\000\000\200>\360\a\230\177\000\000\240g\350\a\230\177\000\000p]\251\016\230\177\000\000D\367\377\324\377\177\000\000\260]\251\016\230\177\000\000\000\000\000\000\000\000\000\000\220]\251\016\230\177\000\000\354\367\377\324\377\177\000\000`F\360\a\230\177"} domain = <value optimized out> #5 0x00007f9814b9086c in dispatch (uap=0x7f9816105010) at task.c:862 dispatch_count = 0 done = isc_boolean_false finished = isc_boolean_false requeue = isc_boolean_false event = 0x7f9807f2bd30 task = <value optimized out> #6 run (uap=0x7f9816105010) at task.c:1005 manager = 0x7f9816105010 #7 0x00007f9814749a3a in start_thread (arg=<value optimized out>) at pthread_create.c:297 __res = <value optimized out> now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140291057739536, 928468285281373604, 140736765441968, 0, 0, 3, -877487593660772956, -877432177704727132}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> freesize = <value optimized out> #8 0x00007f9813cbe67d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 No locals. Created attachment 401612 [details]
bind-9.6.2-1.fc12.x86_64 backtrace
Full backtrace of all threads attached.
(In reply to comment #21) > Created an attachment (id=401612) [details] > bind-9.6.2-1.fc12.x86_64 backtrace > > Full backtrace of all threads attached. Thanks for it. Unfortunately I'm still not able to find circumstaces when assertion is triggered. Would it be possible to get more debug information this way, please? 1. stop named 2. rm -f /var/named/data/named.debug* 3. add following to your named.conf (to logging statement): logging { channel debug_channel { file "/var/named/data/named.debug" versions 1 size 2M; severity debug 3; }; category resolver { debug_channel; }; category dnssec { debug_channel; }; }; 4. start named and wait till it crashes 5. attach named.debug and named.debug.0 files. Thank you. Created attachment 401863 [details]
bind-9.6.2-1.fc12.x86_64 debug log
Created attachment 401865 [details]
bind-9.6.2-1.fc12.x86_64 messages
Created attachment 401866 [details]
bind-9.6.2-1.fc12.x86_64 named.run
There was no named.debug.0 file. But I attached messages and named.run instead which may help.. I can usually reproduce the error in some minutes. But not always. Other thing observed is that it's quite slow responding. Not sure if that's due to the DNSSec overhead or simply because the internet link is very congested at the moment. (In reply to comment #26) > There was no named.debug.0 file. But I attached messages and named.run instead > which may help.. Thanks for your feedback, I think I finally figured where is the problem. Would it be possible to test "bind-9.6.2-1.fc12.2.rh507429" from http://kojiweb.fedoraproject.org/koji/taskinfo?taskID=2069489, please? Use same debug configuration as before. Test package should crash like unpatched one when it hits this issue but will write a message to log that indicates the bug was triggered. On the other hand, if my theory about this bug is not correct, named crashes as well and appropriate message is written to the log. After you get information from the test package please remove it and install the distribution one. > Other thing observed is that it's quite slow responding. Not sure if that's due > to the DNSSec overhead or simply because the internet link is very congested at > the moment. Packets with DNSSEC related records are bigger than plain DNS so it might be a reason. But generally only the first lookups are slow, before top level domains records are validated & cached, then next lookups are quite fast. Looks promising. End of named.debug with that version says: resquery 0x7f0ee573b010 (fctx 0x7f0ee5735410(99.79.208.in-addr.arpa/DS)): sent Query has been resent but now is destroyed. I got you, bug #507429 (hopefully) (In reply to comment #28) > Looks promising. End of named.debug with that version says: > > resquery 0x7f0ee573b010 (fctx 0x7f0ee5735410(99.79.208.in-addr.arpa/DS)): sent > Query has been resent but now is destroyed. I got you, bug #507429 (hopefully) Thanks for your participation. I will prepare a patch and submit an updated package to Fedora. bind-9.6.2-3.P1.fc11 has been submitted as an update for Fedora 11. http://admin.fedoraproject.org/updates/bind-9.6.2-3.P1.fc11 bind-9.6.2-3.P1.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/bind-9.6.2-3.P1.fc12 bind-9.6.2-3.P1.fc12 has been pushed to the Fedora 12 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update bind'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/bind-9.6.2-3.P1.fc12 bind-9.6.2-3.P1.fc11 has been pushed to the Fedora 11 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update bind'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/bind-9.6.2-3.P1.fc11 bind-9.6.2-3.P1.fc11 has been pushed to the Fedora 11 stable repository. If problems still persist, please make note of it in this bug report. bind-9.6.2-3.P1.fc12 has been pushed to the Fedora 12 stable repository. If problems still persist, please make note of it in this bug report. |