Description of problem: Bind does not shut down properly. Version-Release number of selected component (if applicable): Phoebe beta5 bind version 9.2.1-16 How reproducible: Happens every time Steps to Reproduce: 1. Install bind and caching-nameserver 2. Start bind 3. Stop bind (look in /var/log/messages) Actual results: Bind does not stop Expected results: Bind should stop Additional info:
I've added Jakub and Foo to the cc list because the error message in /var/log/messages would seem to indicate a failed pthread_kill()
Same error with bind 9.2.2-2
This looks like a thread signalling problem. I have sent this message out to try to verify it as a thread problem. We have a major bug in named thread signal handling. Basically after you try to stop the named application using service named stop. The service never stops and the application reports. Mar 10 18:25:30 danlaptop lt-named[28582]: app.c:568: unexpected error: Mar 10 18:25:30 danlaptop lt-named[28582]: isc_app_shutdown() pthread_kill: No such process The code causing the problem is the following. #ifdef HAVE_LINUXTHREADS int result; result = pthread_kill(main_thread, SIGTERM); if (result != 0) { isc__strerror(result, strbuf, sizeof(strbuf)); UNEXPECTED_ERROR(__FILE__, __LINE__, "isc_app_shutdown() pthread_kill: %s", strbuf); return (ISC_R_UNEXPECTED); } #else I have checked the main_thread and it looks correct. Another code path through the code uses the SIGHUP signal and this works with main_thread. The other thing that I found was if I run this application without forking it seems to work correctly. Any ideas? Dan
I haven't looked at the sources (yet) but seeing this #ifdef makes the code suspicious. It is probably a work-around for the brokeness of LinuxThreads. NPTL isn't broken when it comes to signal handling and therefore this patch might have negative effects. Try a bind with HAVE_LINUXTHREADS *not* defined.
undefining HAVE_LINUXTHREADS and rebuilding results in the following error message: Mar 9 21:59:49 user-2ivf402 named[7712]: Mar 09 21:59:49.873general: error: app.c:569: unexpected error: Mar 9 21:59:49 user-2ivf402 named[7712]: Mar 09 21:59:49.873general: error: isc_app_shutdown() kill: Operation not permitted
This makes the whole thing even more suspicious. Is there, for platforms like Solaris, another #define to be used? Something like "real POSIX threads"?
These is such an option. I am recompiling bind with the configure patch below that should enable solaris style posix threads. --- bind-9.2.2/configure.orig 2003-03-11 13:46:33.000000000 -0500 +++ bind-9.2.2/configure 2003-03-11 13:46:54.000000000 -0500 @@ -5478,7 +5478,7 @@ # *-linux*) cat >>confdefs.h <<\_ACEOF -#define HAVE_LINUXTHREADS 1 +#define _POSIX_PTHREAD_SEMANTICS 1 _ACEOF
HA! Defining _POSIX_PTHREAD_SEMANTICS and patching out HAVE_LINUXTHREADS seems to work properly. Mar 11 14:13:07 razor named[26422]: starting BIND 9.2.2 -u named Mar 11 14:13:07 razor named[26422]: using 2 CPUs Mar 11 14:13:07 razor named[26422]: loading configuration from '/etc/named.conf' Mar 11 14:13:07 razor named: named startup succeeded Mar 11 14:13:07 razor named[26422]: no IPv6 interfaces found Mar 11 14:13:07 razor named[26422]: listening on IPv4 interface lo, 127.0.0.1#53 Mar 11 14:13:07 razor named[26422]: listening on IPv4 interface eth0, 172.16.52.73#53 Mar 11 14:13:07 razor named[26422]: command channel listening on 127.0.0.1#953 Mar 11 14:13:07 razor named[26422]: zone 0.0.127.in-addr.arpa/IN: loaded serial 1997022700 Mar 11 14:13:07 razor named[26422]: zone localhost/IN: loaded serial 42 Mar 11 14:13:07 razor named[26422]: running Mar 11 14:13:12 razor named[26422]: shutting down: flushing changes Mar 11 14:13:12 razor named[26422]: stopping command channel on 127.0.0.1#953 Mar 11 14:13:12 razor named[26422]: no longer listening on 127.0.0.1#53 Mar 11 14:13:12 razor named[26422]: no longer listening on 172.16.52.73#53 Mar 11 14:13:12 razor named[26422]: exiting
It depends on the particular use. For most cases it is probably possible to write code that works well enough with either library, without explicit checks for which one you have. I would have to see all of the code affected by these conditionals to suggest something, but it seems likely that it is pretty easy to do.
Looking at that source code, both sides of the #ifdef are valid under POSIX and ought to work with NPTL. I am making some assumptions about the rest of the named code that I have not read yet. mgalgoci said this breaks only on UP and not on SMP. I confirmed that using kernel-smp-2.4.20-6 there is no bug, and with kernel-2.4.20-6 it fails as reported. I am investigating.
I have ascertained that the named code in question is valid AFAICT. I am still investigating the failure. It does not happen when named is started with -f, only when it is running in its normal daemon mode.
Oh, I am using bind-9.2.2-2
Created attachment 90566 [details] test case to demonstrate fork vs pthread_kill bug The bind shutdown failure boils down to the bug demonstrated here. What bind does is pthread_self in the main thread, then fork, then the child uses that saved value to call pthread_kill. pthread_kill becomes a tkill system call. Due to a kernel bug, the PID in the data structure is the fork parent's PID rather than the fork child's PID.
Tests were on kernel-2.4.20-6, and I looked at the current sources and found the bug still there easily to be seen. The work of CLONE_CHILD_SETTID is done in schedule_tail (kernel/sched.c). In kernel/entry.S, ret_from_fork calls schedule_tail only #if CONFIG_SMP. Ergo, on a uniprocessor kernel, nothing happens. NPTL's fork uses CLONE_CHILD_SETTID to update the tid field in the pthread_t data structure in the new process's address space. It's the failure to update this field that makes pthread_kill call tkill on the wrong PID in the test case.
I checked in this fix, should be in the next kernel.
Turn off Red Hat Beta flag so I can reference this bug from bind bugs.