426613 – named stops responding

Bug 426613 - named stops responding

Summary: named stops responding

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	bind
Sub Component:
Version:	8
Hardware:	powerpc
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Adam Tkac
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	426615
TreeView+	depends on / blocked

Reported:	2007-12-23 00:05 UTC by David Woodhouse
Modified:	2013-04-30 23:37 UTC (History)
CC List:	2 users (show)
Fixed In Version:	6.6-39.fc8
Clone Of:
Environment:
Last Closed:	2009-01-09 07:32:50 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description David Woodhouse 2007-12-23 00:05:28 UTC

A number of times now, named has simply stopped responding to requests.

(gdb) t a a bt

Thread 7 (Thread 4160156784 (LWP 7686)):
#0  0x1faf9964 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0x1faf2ab0 in pthread_mutex_lock () from /lib/libpthread.so.0
#2  0x1fe72544 in fetch_callback (task=<value optimized out>, ev=0xf3dfc888)
    at adb.c:3322
#3  0x1fba0000 in run (uap=<value optimized out>) at task.c:874
#4  0x1faf0bd4 in start_thread () from /lib/libpthread.so.0
#5  0x1fa393d4 in clone () from /lib/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 6 (Thread 4149671024 (LWP 7687)):
#0  0x1faf9964 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0x1faf2ab0 in pthread_mutex_lock () from /lib/libpthread.so.0
#2  0x1ff1aa84 in dns_resolver_createfetch2 (res=0xf3f6d008, name=0xf2c44098, 
    type=28, domain=0xf2c44188, nameservers=0xf2c46308, 
    forwarders=<value optimized out>, client=0xf2c3d228, id=0, options=32, 
    task=0xf3dcbf10, action=0x20026b40 <query_resume>, arg=0xf2c3d008, 
    rdataset=0xf2c46408, sigrdataset=0x0, fetchp=0xf2c3d1c4) at resolver.c:6831
#3  0x20020bb4 in query_recurse (client=0xf2c3d008, qtype=28, 
    qdomain=0xf2c44188, nameservers=<value optimized out>) at query.c:3018
#4  0x20025c68 in query_find (client=0xf2c3d008, event=0x0, qtype=28)
    at query.c:3779
---Type <return> to continue, or q <return> to quit---
#5  0x20026910 in ns_query_start (client=0xf2c3d008) at query.c:4607
#6  0x20016468 in client_request (task=<value optimized out>, 
    event=<value optimized out>) at client.c:1783
#7  0x1fba0000 in run (uap=<value optimized out>) at task.c:874
#8  0x1faf0bd4 in start_thread () from /lib/libpthread.so.0
#9  0x1fa393d4 in clone () from /lib/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 5 (Thread 4139185264 (LWP 7688)):
#0  0x1faf53b0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x1fb97ccc in isc_rwlock_lock (rwl=0xf3f13b68, type=<value optimized out>)
    at rwlock.c:257
#2  0x1fec10a8 in detachnode (db=0xf3f21008, targetp=0xf6b6e874)
    at rbtdb.c:4335
#3  0x1fe7890c in dns_db_detachnode (db=0xf3f21008, nodep=0xf6b6e874)
    at db.c:525
#4  0x1ff3a434 in dns_view_find (view=0xf7fad008, name=0xf2a8b21c, type=1, 
    now=1198355445, options=1, use_hints=isc_boolean_true, dbp=0x0, nodep=0x0, 
    foundname=0xf6b6e92c, rdataset=0xf6b6e8f0, sigrdataset=0x0) at view.c:885
#5  0x1fe70b98 in dbfind_name (adbname=0xf2a8b218, now=1198355445, rdtype=1)
    at adb.c:3188
#6  0x1fe71f10 in dns_adb_createfind (adb=0x2020ed10, task=0xf3f6e6f0, 
    action=0x1ff17490 <fctx_finddone>, arg=0xf3cc7208, name=0xf6b6ec30, 
---Type <return> to continue, or q <return> to quit---
    qname=0xf3cc7210, qtype=15, options=<value optimized out>, now=1198355445, 
    target=0x0, port=53, findp=0xf6b6ebdc) at adb.c:2605
#7  0x1ff1476c in findname (fctx=0xf3cc7208, name=<value optimized out>, 
    port=0, options=15, flags=0, now=1198355445, need_alternate=0xf6b6eca4)
    at resolver.c:2139
#8  0x1ff167c4 in fctx_try (fctx=<value optimized out>) at resolver.c:2342
#9  0x1ff17758 in fctx_finddone (task=<value optimized out>, event=0x0)
    at resolver.c:1882
#10 0x1fba0000 in run (uap=<value optimized out>) at task.c:874
#11 0x1faf0bd4 in start_thread () from /lib/libpthread.so.0
#12 0x1fa393d4 in clone () from /lib/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 4 (Thread 4128699504 (LWP 7689)):
#0  0x1faf53b0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x1fb97e04 in isc_rwlock_lock (rwl=0xf3f13b68, type=<value optimized out>)
    at rwlock.c:333
#2  0x1febe80c in addrdataset (db=0xf3f21008, node=0xf2ad05f0, version=0x0, 
    now=1198355445, rdataset=0xf616caa4, options=0, addedrdataset=0xf349bbf8)
    at rbtdb.c:5464
#3  0x1fe79d5c in dns_db_addrdataset (db=0xf3f21008, node=0xf2ad05f0, 
    version=0x0, now=1198355445, rdataset=0xf616caa4, options=0, 
    addedrdataset=0xf349bbf8) at db.c:667
---Type <return> to continue, or q <return> to quit---
#4  0x1feaa9b0 in dns_ncache_add (message=0xf35b2110, cache=0xf3f21008, 
    node=0xf2ad05f0, covers=28, now=1198355445, maxttl=0, 
    addedrdataset=0xf349bbf8) at ncache.c:258
#5  0x1ff187fc in ncache_adderesult (message=0xf35b2110, cache=0xf3f21008, 
    node=0xf2ad05f0, covers=28, now=1198355445, maxttl=10800, 
    ardataset=0xf349bbf8, eresultp=0xf616dd7c) at resolver.c:4199
#6  0x1ff1d314 in resquery_response (task=0xf3f6e828, 
    event=<value optimized out>) at resolver.c:4363
#7  0x1fba0000 in run (uap=<value optimized out>) at task.c:874
#8  0x1faf0bd4 in start_thread () from /lib/libpthread.so.0
#9  0x1fa393d4 in clone () from /lib/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 3 (Thread 4118213744 (LWP 7690)):
#0  0x1faf59b0 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1  0x1fbb6c40 in isc_condition_waituntil (c=0xf7f74040, m=0xf7f74010, 
    t=0xf7f74038) at condition.c:59
#2  0x1fba2d58 in run (uap=<value optimized out>) at timer.c:719
#3  0x1faf0bd4 in start_thread () from /lib/libpthread.so.0
#4  0x1fa393d4 in clone () from /lib/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

---Type <return> to continue, or q <return> to quit---
Thread 2 (Thread 4107727984 (LWP 7691)):
#0  0x1fa305f8 in select () from /lib/libc.so.6
#1  0x1fbb3e00 in watcher (uap=0x2007c220) at socket.c:2524
#2  0x1faf0bd4 in start_thread () from /lib/libpthread.so.0
#3  0x1fa393d4 in clone () from /lib/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1 (Thread 4160425984 (LWP 7685)):
#0  0x1f97d00c in __sigsuspend (set=0xfff2fa64)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:63
#1  0x1fba4494 in isc_app_run () at app.c:533
#2  0x20046f00 in main (argc=<value optimized out>, argv=<value optimized out>)
    at ./main.c:878
#3  0x1f96456c in generic_start_main (main=0x20046720 <main>, argc=3, 
    ubp_av=0xfff2fe64, auxvec=0xfff2fe94, init=<value optimized out>, 
    fini=<value optimized out>, rtld_fini=<value optimized out>, 
    stack_end=<value optimized out>) at ../csu/libc-start.c:220
#4  0x1f96473c in __libc_start_main (argc=<value optimized out>, 
    ubp_av=<value optimized out>, ubp_ev=<value optimized out>, 
    auxvec=<value optimized out>, rtld_fini=<value optimized out>, 
    stinfo=<value optimized out>, stack_on_entry=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:92
#5  0x00000000 in ?? ()
---Type <return> to continue, or q <return> to quit---
0x1f97d00c      63        return INLINE_SYSCALL (rt_sigsuspend, 2, CHECK_SIGSET
(set), _NSIG / 8);
(gdb) 
(gdb) c
Continuing.
Warning:
Cannot insert breakpoint -3.
Error accessing memory address 0xfffffffff7fdd630: Input/output error.

Comment 1 Fedora Update System 2008-01-03 01:34:32 UTC

gdb-6.6-39.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 2 David Woodhouse 2008-01-03 01:56:31 UTC

I think #426615 should have been closed, not this.

Comment 3 Adam Tkac 2008-01-03 09:09:54 UTC

(In reply to comment #2)
> I think #426615 should have been closed, not this.

Definitely. Would it be possible to upload core file somewhere? Also please tell
me if named consumes much CPU and memory when stops responding. Log messages and
named.conf will be also useful.

Thanks

Comment 4 David Woodhouse 2008-01-03 11:34:53 UTC

I don't have a core file -- it was still running. Now that I have a working GDB
installed, how about I just let you poke at it directly? If you mail me a ssh
public key, I'll set you up with an account on the machine.

Comment 5 Fajar A. Nugraha 2008-02-05 12:35:06 UTC

Same thing happened to me. ppc64 build gives a better info, as shown on
http://marc.info/?l=bind-users&m=120217756009100&w=2
Temporary workaround : rebuilt with --disable-atomic, and start named with -n 2
Server works fine after that.

Comment 6 Adam Tkac 2008-02-05 14:53:03 UTC

Hm, this might be problem in ISC's implementation of atomic "exchange and add"
on PPC. Would it be possible test
http://people.redhat.com/atkac/test_srpms/bind-9.5.0-24.1.b1.fc8.src.rpm (source
rpm). Address of ppc F8 koji build will be added soon. Please tell me if you
still have problems.

Comment 7 Adam Tkac 2008-02-05 15:01:52 UTC

http://koji.fedoraproject.org/koji/taskinfo?taskID=396508 - build for F8. Please
test ppc package, not ppc64. If you need specific build against some other OS
(different fedora etc) tell me it.

Comment 8 Fajar A. Nugraha 2008-02-06 02:45:46 UTC

Testing it right now on a not-so-busy server. I've rebuild the SRPM for RHEL
5.1. Please provide RHEL5-build for further fixes :)
Why the ppc package, and not ppc64? We'd like to use ppc64 binary as it (should)
be able to use more memory for cache.

Comment 9 Adam Tkac 2008-02-06 10:39:07 UTC

If you really need such much memory ppc64 binary makes sence for you. But ppc64
binaries are significantly slower than ppc32 binaries. Also I'm not sure if
patch for this issue will works on ppc64 because it's ppc assembler :) I need
know if problem is really in bind's atomic operations and if yes I will discuss
proper fix in upstream

Comment 10 David Woodhouse 2008-02-06 20:32:36 UTC

Why do you say that ppc64 binaries are "significantly slower" than ppc32
binaries? That shouldn't be the case in general -- if it is, please file a bug
so that we can investigate. They're _slightly_ slower, because of the use of
function descriptors and the 'bloat' of 64-bit pointers and 'long' types. But
for the case where a single userspace process actually _wants_ more than 4GiB of
memory, that isn't entirely bloat. 

The same atomic functions should work for 32-bit and 64-bit code as long as they
get built in, although obviously they operate on 32-bit 'atomic' type only.
Since the function prototype indicates that the type is 'isc_int32_t', that
seems OK.

It looks like the original version of xadd can be given input values in r6, and
it will scribble on them before using them. The new version imported from glibc
won't do that -- but don't you also have to fix isc_atomic_cmpxhg() too, which
has the same problem?

Comment 11 David Woodhouse 2008-02-07 18:31:51 UTC

(In reply to comment #10)
> It looks like the original version of xadd can be given input values in r6, and
> it will scribble on them before using them. The new version imported from glibc
> won't do that -- but don't you also have to fix isc_atomic_cmpxhg() too, which
> has the same problem?

I lie. GCC documentation promises that the clobbered register won't be used as
an input. So I don't see any problems with the existing routines.

Comment 12 Fajar A. Nugraha 2008-02-11 01:29:45 UTC

The patch doesn't work as intended though. Named, ppc binary, simply stop
responding. On the other hand, the binary compiled with --disabled atomic is
still running correctly, so I believe the problem IS because of the atomic
operation.

Comment 13 Adam Tkac 2008-02-13 15:22:27 UTC

(In reply to comment #12)
> The patch doesn't work as intended though. Named, ppc binary, simply stop
> responding. On the other hand, the binary compiled with --disabled atomic is
> still running correctly, so I believe the problem IS because of the atomic
> operation.

We checked disasembled code with David Woodhouse and it looks problem is not
directly in atomic operations. That's weird because this problem hits only ppc.
Please report if binary without atomic operation starts have same problems. I'm
going to discuss this with ISC.

Comment 14 Fajar A. Nugraha 2008-02-15 06:04:03 UTC

Actually, we DO have a problem. We have two servers running bind 9.5.0b1, ppc,
without atomic. 

On one server, it dies, and on /var/log/messages I get this log :

Feb 13 04:01:15 <hostname here> out of memory [13668]

which SHOULD be normal, because max memory for 32bit userspace app is limited
(what is the limit BTW? 2G? 4G?). What makes it abnormal is that :
- named was consuming over 2G of memory (RES from htop)
- max-cache-size was set to 1G
- max-acache-size was set to 256M

It MIGHT be related to http://marc.info/?l=bind-users&m=119715064818829&w=2 , so
not really a ppc-specific problem

On the other server, hovewer, named simply dies (after about one week), no
message about memory error on /var/log/messages. How can I set named to log why
it dies, or to leave core dump if it crashed?

On a side note, these two cases are actually better than the one with atomic,
because in his case I can simply use some kind of supervised system to (re)start
named when it crashed.

Comment 15 Adam Tkac 2008-02-15 11:14:35 UTC

(In reply to comment #14)
> which SHOULD be normal, because max memory for 32bit userspace app is limited
> (what is the limit BTW? 2G? 4G?). What makes it abnormal is that :
> - named was consuming over 2G of memory (RES from htop)
> - max-cache-size was set to 1G
> - max-acache-size was set to 256M

Pointers are 32bit so limit should be 4G. Btw as David Woodhouse wrote above you
will use ppc64 binary because atomic operations are same on both ppc and ppc64.
(Or if you disabled atomics there's also no problem with ppc64 binary)

> 
> It MIGHT be related to http://marc.info/?l=bind-users&m=119715064818829&w=2 , so
> not really a ppc-specific problem
> 
> On the other server, hovewer, named simply dies (after about one week), no
> message about memory error on /var/log/messages. How can I set named to log why
> it dies, or to leave core dump if it crashed?

This is quite problematic. BIND's working directory has to be writable. The best
way is set named working directory to /var/named/data (directory option in
options statement in named.conf) and modify all zone paths (add '../' to start
of each zone). If you have many zones you will simply 'chmod 775 /var/named' and
set SELinux policy appropriately. After segfault you will find corefile in
named's working directory.

Comment 16 Adam Tkac 2008-09-24 09:19:30 UTC

Could someone verify this is still reproducable with latest BIND packages, please?

Comment 17 Bug Zapper 2008-11-26 09:07:52 UTC

This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 18 Bug Zapper 2009-01-09 07:32:50 UTC

Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.