A number of times now, named has simply stopped responding to requests. (gdb) t a a bt Thread 7 (Thread 4160156784 (LWP 7686)): #0 0x1faf9964 in __lll_lock_wait () from /lib/libpthread.so.0 #1 0x1faf2ab0 in pthread_mutex_lock () from /lib/libpthread.so.0 #2 0x1fe72544 in fetch_callback (task=<value optimized out>, ev=0xf3dfc888) at adb.c:3322 #3 0x1fba0000 in run (uap=<value optimized out>) at task.c:874 #4 0x1faf0bd4 in start_thread () from /lib/libpthread.so.0 #5 0x1fa393d4 in clone () from /lib/libc.so.6 Backtrace stopped: previous frame inner to this frame (corrupt stack?) Thread 6 (Thread 4149671024 (LWP 7687)): #0 0x1faf9964 in __lll_lock_wait () from /lib/libpthread.so.0 #1 0x1faf2ab0 in pthread_mutex_lock () from /lib/libpthread.so.0 #2 0x1ff1aa84 in dns_resolver_createfetch2 (res=0xf3f6d008, name=0xf2c44098, type=28, domain=0xf2c44188, nameservers=0xf2c46308, forwarders=<value optimized out>, client=0xf2c3d228, id=0, options=32, task=0xf3dcbf10, action=0x20026b40 <query_resume>, arg=0xf2c3d008, rdataset=0xf2c46408, sigrdataset=0x0, fetchp=0xf2c3d1c4) at resolver.c:6831 #3 0x20020bb4 in query_recurse (client=0xf2c3d008, qtype=28, qdomain=0xf2c44188, nameservers=<value optimized out>) at query.c:3018 #4 0x20025c68 in query_find (client=0xf2c3d008, event=0x0, qtype=28) at query.c:3779 ---Type <return> to continue, or q <return> to quit--- #5 0x20026910 in ns_query_start (client=0xf2c3d008) at query.c:4607 #6 0x20016468 in client_request (task=<value optimized out>, event=<value optimized out>) at client.c:1783 #7 0x1fba0000 in run (uap=<value optimized out>) at task.c:874 #8 0x1faf0bd4 in start_thread () from /lib/libpthread.so.0 #9 0x1fa393d4 in clone () from /lib/libc.so.6 Backtrace stopped: previous frame inner to this frame (corrupt stack?) Thread 5 (Thread 4139185264 (LWP 7688)): #0 0x1faf53b0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x1fb97ccc in isc_rwlock_lock (rwl=0xf3f13b68, type=<value optimized out>) at rwlock.c:257 #2 0x1fec10a8 in detachnode (db=0xf3f21008, targetp=0xf6b6e874) at rbtdb.c:4335 #3 0x1fe7890c in dns_db_detachnode (db=0xf3f21008, nodep=0xf6b6e874) at db.c:525 #4 0x1ff3a434 in dns_view_find (view=0xf7fad008, name=0xf2a8b21c, type=1, now=1198355445, options=1, use_hints=isc_boolean_true, dbp=0x0, nodep=0x0, foundname=0xf6b6e92c, rdataset=0xf6b6e8f0, sigrdataset=0x0) at view.c:885 #5 0x1fe70b98 in dbfind_name (adbname=0xf2a8b218, now=1198355445, rdtype=1) at adb.c:3188 #6 0x1fe71f10 in dns_adb_createfind (adb=0x2020ed10, task=0xf3f6e6f0, action=0x1ff17490 <fctx_finddone>, arg=0xf3cc7208, name=0xf6b6ec30, ---Type <return> to continue, or q <return> to quit--- qname=0xf3cc7210, qtype=15, options=<value optimized out>, now=1198355445, target=0x0, port=53, findp=0xf6b6ebdc) at adb.c:2605 #7 0x1ff1476c in findname (fctx=0xf3cc7208, name=<value optimized out>, port=0, options=15, flags=0, now=1198355445, need_alternate=0xf6b6eca4) at resolver.c:2139 #8 0x1ff167c4 in fctx_try (fctx=<value optimized out>) at resolver.c:2342 #9 0x1ff17758 in fctx_finddone (task=<value optimized out>, event=0x0) at resolver.c:1882 #10 0x1fba0000 in run (uap=<value optimized out>) at task.c:874 #11 0x1faf0bd4 in start_thread () from /lib/libpthread.so.0 #12 0x1fa393d4 in clone () from /lib/libc.so.6 Backtrace stopped: previous frame inner to this frame (corrupt stack?) Thread 4 (Thread 4128699504 (LWP 7689)): #0 0x1faf53b0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x1fb97e04 in isc_rwlock_lock (rwl=0xf3f13b68, type=<value optimized out>) at rwlock.c:333 #2 0x1febe80c in addrdataset (db=0xf3f21008, node=0xf2ad05f0, version=0x0, now=1198355445, rdataset=0xf616caa4, options=0, addedrdataset=0xf349bbf8) at rbtdb.c:5464 #3 0x1fe79d5c in dns_db_addrdataset (db=0xf3f21008, node=0xf2ad05f0, version=0x0, now=1198355445, rdataset=0xf616caa4, options=0, addedrdataset=0xf349bbf8) at db.c:667 ---Type <return> to continue, or q <return> to quit--- #4 0x1feaa9b0 in dns_ncache_add (message=0xf35b2110, cache=0xf3f21008, node=0xf2ad05f0, covers=28, now=1198355445, maxttl=0, addedrdataset=0xf349bbf8) at ncache.c:258 #5 0x1ff187fc in ncache_adderesult (message=0xf35b2110, cache=0xf3f21008, node=0xf2ad05f0, covers=28, now=1198355445, maxttl=10800, ardataset=0xf349bbf8, eresultp=0xf616dd7c) at resolver.c:4199 #6 0x1ff1d314 in resquery_response (task=0xf3f6e828, event=<value optimized out>) at resolver.c:4363 #7 0x1fba0000 in run (uap=<value optimized out>) at task.c:874 #8 0x1faf0bd4 in start_thread () from /lib/libpthread.so.0 #9 0x1fa393d4 in clone () from /lib/libc.so.6 Backtrace stopped: previous frame inner to this frame (corrupt stack?) Thread 3 (Thread 4118213744 (LWP 7690)): #0 0x1faf59b0 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x1fbb6c40 in isc_condition_waituntil (c=0xf7f74040, m=0xf7f74010, t=0xf7f74038) at condition.c:59 #2 0x1fba2d58 in run (uap=<value optimized out>) at timer.c:719 #3 0x1faf0bd4 in start_thread () from /lib/libpthread.so.0 #4 0x1fa393d4 in clone () from /lib/libc.so.6 Backtrace stopped: previous frame inner to this frame (corrupt stack?) ---Type <return> to continue, or q <return> to quit--- Thread 2 (Thread 4107727984 (LWP 7691)): #0 0x1fa305f8 in select () from /lib/libc.so.6 #1 0x1fbb3e00 in watcher (uap=0x2007c220) at socket.c:2524 #2 0x1faf0bd4 in start_thread () from /lib/libpthread.so.0 #3 0x1fa393d4 in clone () from /lib/libc.so.6 Backtrace stopped: previous frame inner to this frame (corrupt stack?) Thread 1 (Thread 4160425984 (LWP 7685)): #0 0x1f97d00c in __sigsuspend (set=0xfff2fa64) at ../sysdeps/unix/sysv/linux/sigsuspend.c:63 #1 0x1fba4494 in isc_app_run () at app.c:533 #2 0x20046f00 in main (argc=<value optimized out>, argv=<value optimized out>) at ./main.c:878 #3 0x1f96456c in generic_start_main (main=0x20046720 <main>, argc=3, ubp_av=0xfff2fe64, auxvec=0xfff2fe94, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=<value optimized out>) at ../csu/libc-start.c:220 #4 0x1f96473c in __libc_start_main (argc=<value optimized out>, ubp_av=<value optimized out>, ubp_ev=<value optimized out>, auxvec=<value optimized out>, rtld_fini=<value optimized out>, stinfo=<value optimized out>, stack_on_entry=<value optimized out>) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:92 #5 0x00000000 in ?? () ---Type <return> to continue, or q <return> to quit--- 0x1f97d00c 63 return INLINE_SYSCALL (rt_sigsuspend, 2, CHECK_SIGSET (set), _NSIG / 8); (gdb) (gdb) c Continuing. Warning: Cannot insert breakpoint -3. Error accessing memory address 0xfffffffff7fdd630: Input/output error.
gdb-6.6-39.fc8 has been pushed to the Fedora 8 stable repository. If problems still persist, please make note of it in this bug report.
I think #426615 should have been closed, not this.
(In reply to comment #2) > I think #426615 should have been closed, not this. Definitely. Would it be possible to upload core file somewhere? Also please tell me if named consumes much CPU and memory when stops responding. Log messages and named.conf will be also useful. Thanks
I don't have a core file -- it was still running. Now that I have a working GDB installed, how about I just let you poke at it directly? If you mail me a ssh public key, I'll set you up with an account on the machine.
Same thing happened to me. ppc64 build gives a better info, as shown on http://marc.info/?l=bind-users&m=120217756009100&w=2 Temporary workaround : rebuilt with --disable-atomic, and start named with -n 2 Server works fine after that.
Hm, this might be problem in ISC's implementation of atomic "exchange and add" on PPC. Would it be possible test http://people.redhat.com/atkac/test_srpms/bind-9.5.0-24.1.b1.fc8.src.rpm (source rpm). Address of ppc F8 koji build will be added soon. Please tell me if you still have problems.
http://koji.fedoraproject.org/koji/taskinfo?taskID=396508 - build for F8. Please test ppc package, not ppc64. If you need specific build against some other OS (different fedora etc) tell me it.
Testing it right now on a not-so-busy server. I've rebuild the SRPM for RHEL 5.1. Please provide RHEL5-build for further fixes :) Why the ppc package, and not ppc64? We'd like to use ppc64 binary as it (should) be able to use more memory for cache.
If you really need such much memory ppc64 binary makes sence for you. But ppc64 binaries are significantly slower than ppc32 binaries. Also I'm not sure if patch for this issue will works on ppc64 because it's ppc assembler :) I need know if problem is really in bind's atomic operations and if yes I will discuss proper fix in upstream
Why do you say that ppc64 binaries are "significantly slower" than ppc32 binaries? That shouldn't be the case in general -- if it is, please file a bug so that we can investigate. They're _slightly_ slower, because of the use of function descriptors and the 'bloat' of 64-bit pointers and 'long' types. But for the case where a single userspace process actually _wants_ more than 4GiB of memory, that isn't entirely bloat. The same atomic functions should work for 32-bit and 64-bit code as long as they get built in, although obviously they operate on 32-bit 'atomic' type only. Since the function prototype indicates that the type is 'isc_int32_t', that seems OK. It looks like the original version of xadd can be given input values in r6, and it will scribble on them before using them. The new version imported from glibc won't do that -- but don't you also have to fix isc_atomic_cmpxhg() too, which has the same problem?
(In reply to comment #10) > It looks like the original version of xadd can be given input values in r6, and > it will scribble on them before using them. The new version imported from glibc > won't do that -- but don't you also have to fix isc_atomic_cmpxhg() too, which > has the same problem? I lie. GCC documentation promises that the clobbered register won't be used as an input. So I don't see any problems with the existing routines.
The patch doesn't work as intended though. Named, ppc binary, simply stop responding. On the other hand, the binary compiled with --disabled atomic is still running correctly, so I believe the problem IS because of the atomic operation.
(In reply to comment #12) > The patch doesn't work as intended though. Named, ppc binary, simply stop > responding. On the other hand, the binary compiled with --disabled atomic is > still running correctly, so I believe the problem IS because of the atomic > operation. We checked disasembled code with David Woodhouse and it looks problem is not directly in atomic operations. That's weird because this problem hits only ppc. Please report if binary without atomic operation starts have same problems. I'm going to discuss this with ISC.
Actually, we DO have a problem. We have two servers running bind 9.5.0b1, ppc, without atomic. On one server, it dies, and on /var/log/messages I get this log : Feb 13 04:01:15 <hostname here> out of memory [13668] which SHOULD be normal, because max memory for 32bit userspace app is limited (what is the limit BTW? 2G? 4G?). What makes it abnormal is that : - named was consuming over 2G of memory (RES from htop) - max-cache-size was set to 1G - max-acache-size was set to 256M It MIGHT be related to http://marc.info/?l=bind-users&m=119715064818829&w=2 , so not really a ppc-specific problem On the other server, hovewer, named simply dies (after about one week), no message about memory error on /var/log/messages. How can I set named to log why it dies, or to leave core dump if it crashed? On a side note, these two cases are actually better than the one with atomic, because in his case I can simply use some kind of supervised system to (re)start named when it crashed.
(In reply to comment #14) > which SHOULD be normal, because max memory for 32bit userspace app is limited > (what is the limit BTW? 2G? 4G?). What makes it abnormal is that : > - named was consuming over 2G of memory (RES from htop) > - max-cache-size was set to 1G > - max-acache-size was set to 256M Pointers are 32bit so limit should be 4G. Btw as David Woodhouse wrote above you will use ppc64 binary because atomic operations are same on both ppc and ppc64. (Or if you disabled atomics there's also no problem with ppc64 binary) > > It MIGHT be related to http://marc.info/?l=bind-users&m=119715064818829&w=2 , so > not really a ppc-specific problem > > On the other server, hovewer, named simply dies (after about one week), no > message about memory error on /var/log/messages. How can I set named to log why > it dies, or to leave core dump if it crashed? This is quite problematic. BIND's working directory has to be writable. The best way is set named working directory to /var/named/data (directory option in options statement in named.conf) and modify all zone paths (add '../' to start of each zone). If you have many zones you will simply 'chmod 775 /var/named' and set SELinux policy appropriately. After segfault you will find corefile in named's working directory.
Could someone verify this is still reproducable with latest BIND packages, please?
This message is a reminder that Fedora 8 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 8. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '8'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 8's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 8 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.