85896 – uniprocessor kernel ignores CLONE_CHILD_SETTID

Bug 85896 - uniprocessor kernel ignores CLONE_CHILD_SETTID

Summary: uniprocessor kernel ignores CLONE_CHILD_SETTID

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Linux Beta
Classification:	Retired
Component:	kernel
Sub Component:
Version:	beta5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ingo Molnar
QA Contact:	Ben Levenson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	87976
TreeView+	depends on / blocked

Reported:	2003-03-10 15:31 UTC by Matthew Galgoci
Modified:	2007-04-18 16:51 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-03-14 07:51:50 UTC
Embargoed:

Attachments	(Terms of Use)
test case to demonstrate fork vs pthread_kill bug (1.56 KB, text/plain) 2003-03-12 00:19 UTC, Roland McGrath	no flags	Details
View All

Description Matthew Galgoci 2003-03-10 15:31:48 UTC

Description of problem:

Bind does not shut down properly.

Version-Release number of selected component (if applicable):

Phoebe beta5
bind version 9.2.1-16

How reproducible:

Happens every time

Steps to Reproduce:
1. Install bind and caching-nameserver
2. Start bind
3. Stop bind (look in /var/log/messages)
    
Actual results:

Bind does not stop

Expected results:

Bind should stop

Additional info:

Comment 1 Matthew Galgoci 2003-03-10 15:38:16 UTC

I've added Jakub and Foo to the cc list because the error message in 
/var/log/messages would seem to indicate a failed pthread_kill()

Comment 2 Matthew Galgoci 2003-03-10 15:47:30 UTC

Same error with bind 9.2.2-2

Comment 3 Daniel Walsh 2003-03-11 15:02:53 UTC

This looks like a thread signalling problem.  I have sent this message out to
try to verify it as a thread problem.

We have a major bug in named thread signal handling.  Basically after you try to
stop the named application
using service named stop.  The service never stops and the application reports.

Mar 10 18:25:30 danlaptop lt-named[28582]: app.c:568: unexpected error:
Mar 10 18:25:30 danlaptop lt-named[28582]: isc_app_shutdown() pthread_kill: No
such process

The code causing the problem is the following.

#ifdef HAVE_LINUXTHREADS
       int result;

       result = pthread_kill(main_thread, SIGTERM);
       if (result != 0) {
           isc__strerror(result, strbuf, sizeof(strbuf));
           UNEXPECTED_ERROR(__FILE__, __LINE__,
                    "isc_app_shutdown() pthread_kill: %s",
                    strbuf);
           return (ISC_R_UNEXPECTED);
       }
#else

I have checked the main_thread and it looks correct.  Another code path through
the code uses the SIGHUP signal and this works with main_thread. The other thing
that I found was if I run this application without forking it seems to work
correctly.  Any ideas?

Dan

Comment 4 Ulrich Drepper 2003-03-11 17:50:48 UTC

I haven't looked at the sources (yet) but seeing this #ifdef makes the code
suspicious.  It is probably a work-around for the brokeness of LinuxThreads. 
NPTL isn't broken when it comes to signal handling and therefore this patch
might have negative effects.

Try a bind with HAVE_LINUXTHREADS *not* defined.

Comment 5 Matthew Galgoci 2003-03-11 18:02:55 UTC

undefining HAVE_LINUXTHREADS and rebuilding results in the following error
message:

Mar  9 21:59:49 user-2ivf402 named[7712]: Mar 09 21:59:49.873general: error:
app.c:569: unexpected error:
Mar  9 21:59:49 user-2ivf402 named[7712]: Mar 09 21:59:49.873general: error:
isc_app_shutdown() kill: Operation not permitted

Comment 6 Ulrich Drepper 2003-03-11 18:41:56 UTC

This makes the whole thing even more suspicious.

Is there, for platforms like Solaris, another #define to be used?  Something
like "real POSIX threads"?

Comment 7 Matthew Galgoci 2003-03-11 18:51:12 UTC

These is such an option. I am recompiling bind with the configure patch 
below that should enable solaris style posix threads.

--- bind-9.2.2/configure.orig   2003-03-11 13:46:33.000000000 -0500
+++ bind-9.2.2/configure        2003-03-11 13:46:54.000000000 -0500
@@ -5478,7 +5478,7 @@
                #
                *-linux*)
                        cat >>confdefs.h <<\_ACEOF
-#define HAVE_LINUXTHREADS 1
+#define _POSIX_PTHREAD_SEMANTICS 1
 _ACEOF

Comment 8 Matthew Galgoci 2003-03-11 19:16:58 UTC

HA! Defining _POSIX_PTHREAD_SEMANTICS and patching out HAVE_LINUXTHREADS
seems to work properly.

Mar 11 14:13:07 razor named[26422]: starting BIND 9.2.2 -u named
Mar 11 14:13:07 razor named[26422]: using 2 CPUs
Mar 11 14:13:07 razor named[26422]: loading configuration from '/etc/named.conf'
Mar 11 14:13:07 razor named: named startup succeeded
Mar 11 14:13:07 razor named[26422]: no IPv6 interfaces found
Mar 11 14:13:07 razor named[26422]: listening on IPv4 interface lo, 127.0.0.1#53
Mar 11 14:13:07 razor named[26422]: listening on IPv4 interface eth0,
172.16.52.73#53
Mar 11 14:13:07 razor named[26422]: command channel listening on 127.0.0.1#953
Mar 11 14:13:07 razor named[26422]: zone 0.0.127.in-addr.arpa/IN: loaded serial
1997022700
Mar 11 14:13:07 razor named[26422]: zone localhost/IN: loaded serial 42
Mar 11 14:13:07 razor named[26422]: running
Mar 11 14:13:12 razor named[26422]: shutting down: flushing changes
Mar 11 14:13:12 razor named[26422]: stopping command channel on 127.0.0.1#953
Mar 11 14:13:12 razor named[26422]: no longer listening on 127.0.0.1#53
Mar 11 14:13:12 razor named[26422]: no longer listening on 172.16.52.73#53
Mar 11 14:13:12 razor named[26422]: exiting

Comment 10 Roland McGrath 2003-03-11 20:48:10 UTC

It depends on the particular use.  For most cases it is probably possible to
write code that works well enough with either library, without explicit checks
for which one you have.  I would have to see all of the code affected by these
conditionals to suggest something, but it seems likely that it is pretty easy to do.

Comment 11 Roland McGrath 2003-03-11 21:37:35 UTC

Looking at that source code, both sides of the #ifdef are valid under POSIX
and ought to work with NPTL.  I am making some assumptions about the rest of
the named code that I have not read yet.

mgalgoci said this breaks only on UP and not on SMP.
I confirmed that using kernel-smp-2.4.20-6 there is no bug,
and with kernel-2.4.20-6 it fails as reported.  I am investigating.

Comment 12 Roland McGrath 2003-03-11 23:22:21 UTC

I have ascertained that the named code in question is valid AFAICT.
I am still investigating the failure.  It does not happen when named is
started with -f, only when it is running in its normal daemon mode.

Comment 13 Roland McGrath 2003-03-11 23:23:33 UTC

Oh, I am using bind-9.2.2-2

Comment 14 Roland McGrath 2003-03-12 00:19:28 UTC

Created attachment 90566 [details]
test case to demonstrate fork vs pthread_kill bug

The bind shutdown failure boils down to the bug demonstrated here.
What bind does is pthread_self in the main thread, then fork, then
the child uses that saved value to call pthread_kill.

pthread_kill becomes a tkill system call.
Due to a kernel bug, the PID in the data structure is the fork parent's
PID rather than the fork child's PID.

Comment 15 Roland McGrath 2003-03-12 00:25:48 UTC

Tests were on kernel-2.4.20-6, and I looked at the current sources and found the
bug still there easily to be seen.

The work of CLONE_CHILD_SETTID is done in schedule_tail (kernel/sched.c).
In kernel/entry.S, ret_from_fork calls schedule_tail only #if CONFIG_SMP.
Ergo, on a uniprocessor kernel, nothing happens.

NPTL's fork uses CLONE_CHILD_SETTID to update the tid field in the pthread_t
data structure in the new process's address space.
It's the failure to update this field that makes pthread_kill call tkill
on the wrong PID in the test case.

Comment 16 Ingo Molnar 2003-03-14 07:51:50 UTC

I checked in this fix, should be in the next kernel.

Comment 17 Daniel Walsh 2003-04-07 20:45:47 UTC

Turn off Red Hat Beta flag so I can reference this bug from bind bugs.

Note You need to log in before you can comment on or make changes to this bug.