Bug 85896 - uniprocessor kernel ignores CLONE_CHILD_SETTID
Summary: uniprocessor kernel ignores CLONE_CHILD_SETTID
Alias: None
Product: Red Hat Linux Beta
Classification: Retired
Component: kernel
Version: beta5
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Ingo Molnar
QA Contact: Ben Levenson
Depends On:
Blocks: 87976
TreeView+ depends on / blocked
Reported: 2003-03-10 15:31 UTC by Matthew Galgoci
Modified: 2007-04-18 16:51 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2003-03-14 07:51:50 UTC

Attachments (Terms of Use)
test case to demonstrate fork vs pthread_kill bug (1.56 KB, text/plain)
2003-03-12 00:19 UTC, Roland McGrath
no flags Details

Description Matthew Galgoci 2003-03-10 15:31:48 UTC
Description of problem:

Bind does not shut down properly.

Version-Release number of selected component (if applicable):

Phoebe beta5
bind version 9.2.1-16

How reproducible:

Happens every time

Steps to Reproduce:
1. Install bind and caching-nameserver
2. Start bind
3. Stop bind (look in /var/log/messages)
Actual results:

Bind does not stop

Expected results:

Bind should stop

Additional info:

Comment 1 Matthew Galgoci 2003-03-10 15:38:16 UTC
I've added Jakub and Foo to the cc list because the error message in 
/var/log/messages would seem to indicate a failed pthread_kill()

Comment 2 Matthew Galgoci 2003-03-10 15:47:30 UTC
Same error with bind 9.2.2-2

Comment 3 Daniel Walsh 2003-03-11 15:02:53 UTC
This looks like a thread signalling problem.  I have sent this message out to
try to verify it as a thread problem.

We have a major bug in named thread signal handling.  Basically after you try to
stop the named application
using service named stop.  The service never stops and the application reports.

Mar 10 18:25:30 danlaptop lt-named[28582]: app.c:568: unexpected error:
Mar 10 18:25:30 danlaptop lt-named[28582]: isc_app_shutdown() pthread_kill: No
such process

The code causing the problem is the following.

       int result;

       result = pthread_kill(main_thread, SIGTERM);
       if (result != 0) {
           isc__strerror(result, strbuf, sizeof(strbuf));
           UNEXPECTED_ERROR(__FILE__, __LINE__,
                    "isc_app_shutdown() pthread_kill: %s",
           return (ISC_R_UNEXPECTED);

I have checked the main_thread and it looks correct.  Another code path through
the code uses the SIGHUP signal and this works with main_thread. The other thing
that I found was if I run this application without forking it seems to work
correctly.  Any ideas?


Comment 4 Ulrich Drepper 2003-03-11 17:50:48 UTC
I haven't looked at the sources (yet) but seeing this #ifdef makes the code
suspicious.  It is probably a work-around for the brokeness of LinuxThreads. 
NPTL isn't broken when it comes to signal handling and therefore this patch
might have negative effects.

Try a bind with HAVE_LINUXTHREADS *not* defined.

Comment 5 Matthew Galgoci 2003-03-11 18:02:55 UTC
undefining HAVE_LINUXTHREADS and rebuilding results in the following error

Mar  9 21:59:49 user-2ivf402 named[7712]: Mar 09 21:59:49.873general: error:
app.c:569: unexpected error:
Mar  9 21:59:49 user-2ivf402 named[7712]: Mar 09 21:59:49.873general: error:
isc_app_shutdown() kill: Operation not permitted

Comment 6 Ulrich Drepper 2003-03-11 18:41:56 UTC
This makes the whole thing even more suspicious.

Is there, for platforms like Solaris, another #define to be used?  Something
like "real POSIX threads"?

Comment 7 Matthew Galgoci 2003-03-11 18:51:12 UTC
These is such an option. I am recompiling bind with the configure patch 
below that should enable solaris style posix threads.

--- bind-9.2.2/configure.orig   2003-03-11 13:46:33.000000000 -0500
+++ bind-9.2.2/configure        2003-03-11 13:46:54.000000000 -0500
@@ -5478,7 +5478,7 @@
                        cat >>confdefs.h <<\_ACEOF

Comment 8 Matthew Galgoci 2003-03-11 19:16:58 UTC
seems to work properly.

Mar 11 14:13:07 razor named[26422]: starting BIND 9.2.2 -u named
Mar 11 14:13:07 razor named[26422]: using 2 CPUs
Mar 11 14:13:07 razor named[26422]: loading configuration from '/etc/named.conf'
Mar 11 14:13:07 razor named: named startup succeeded
Mar 11 14:13:07 razor named[26422]: no IPv6 interfaces found
Mar 11 14:13:07 razor named[26422]: listening on IPv4 interface lo,
Mar 11 14:13:07 razor named[26422]: listening on IPv4 interface eth0,
Mar 11 14:13:07 razor named[26422]: command channel listening on
Mar 11 14:13:07 razor named[26422]: zone 0.0.127.in-addr.arpa/IN: loaded serial
Mar 11 14:13:07 razor named[26422]: zone localhost/IN: loaded serial 42
Mar 11 14:13:07 razor named[26422]: running
Mar 11 14:13:12 razor named[26422]: shutting down: flushing changes
Mar 11 14:13:12 razor named[26422]: stopping command channel on
Mar 11 14:13:12 razor named[26422]: no longer listening on
Mar 11 14:13:12 razor named[26422]: no longer listening on
Mar 11 14:13:12 razor named[26422]: exiting

Comment 10 Roland McGrath 2003-03-11 20:48:10 UTC
It depends on the particular use.  For most cases it is probably possible to
write code that works well enough with either library, without explicit checks
for which one you have.  I would have to see all of the code affected by these
conditionals to suggest something, but it seems likely that it is pretty easy to do.

Comment 11 Roland McGrath 2003-03-11 21:37:35 UTC
Looking at that source code, both sides of the #ifdef are valid under POSIX
and ought to work with NPTL.  I am making some assumptions about the rest of
the named code that I have not read yet.

mgalgoci said this breaks only on UP and not on SMP.
I confirmed that using kernel-smp-2.4.20-6 there is no bug,
and with kernel-2.4.20-6 it fails as reported.  I am investigating.

Comment 12 Roland McGrath 2003-03-11 23:22:21 UTC
I have ascertained that the named code in question is valid AFAICT.
I am still investigating the failure.  It does not happen when named is
started with -f, only when it is running in its normal daemon mode.

Comment 13 Roland McGrath 2003-03-11 23:23:33 UTC
Oh, I am using bind-9.2.2-2

Comment 14 Roland McGrath 2003-03-12 00:19:28 UTC
Created attachment 90566 [details]
test case to demonstrate fork vs pthread_kill bug

The bind shutdown failure boils down to the bug demonstrated here.
What bind does is pthread_self in the main thread, then fork, then
the child uses that saved value to call pthread_kill.

pthread_kill becomes a tkill system call.
Due to a kernel bug, the PID in the data structure is the fork parent's
PID rather than the fork child's PID.

Comment 15 Roland McGrath 2003-03-12 00:25:48 UTC
Tests were on kernel-2.4.20-6, and I looked at the current sources and found the
bug still there easily to be seen.

The work of CLONE_CHILD_SETTID is done in schedule_tail (kernel/sched.c).
In kernel/entry.S, ret_from_fork calls schedule_tail only #if CONFIG_SMP.
Ergo, on a uniprocessor kernel, nothing happens.

NPTL's fork uses CLONE_CHILD_SETTID to update the tid field in the pthread_t
data structure in the new process's address space.
It's the failure to update this field that makes pthread_kill call tkill
on the wrong PID in the test case.

Comment 16 Ingo Molnar 2003-03-14 07:51:50 UTC
I checked in this fix, should be in the next kernel.

Comment 17 Daniel Walsh 2003-04-07 20:45:47 UTC
Turn off Red Hat Beta flag so I can reference this bug from bind bugs.

Note You need to log in before you can comment on or make changes to this bug.