Bug 408321 - Process control signals cause pthread_cond_timedwait to return ETIMEDOUT
Summary: Process control signals cause pthread_cond_timedwait to return ETIMEDOUT
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: 1.0
Hardware: i386
OS: Linux
low
low
Target Milestone: ---
: ---
Assignee: Luis Claudio R. Goncalves
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-12-03 06:20 UTC by David Holmes
Modified: 2008-02-27 19:57 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-01-09 13:43:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Test program for pthread_cond_timedwait usage when signal arrives (4.42 KB, text/plain)
2007-12-03 06:20 UTC, David Holmes
no flags Details
patch to fix stack corruption in futex code (2.65 KB, patch)
2007-12-05 14:24 UTC, Steven Rostedt
no flags Details | Diff

Description David Holmes 2007-12-03 06:20:32 UTC
Description of problem:

If a thread is executing pthread_cond_timedwait and the process is "suspended"
using ctrl-Z, then moved into the background using "bg", the
pthread_cond_timedwait immediately returns with a status ETIMEDOUT. This is
incorrect behaviour.

How reproducible:

Always reproducible.

Steps to Reproduce:
1. Compile attached program
2. Execute program
3. ctrl-Z to suspend program
4. 'bg' to backgroun program
  
Actual results:

rt-x4200-dev-2 /net/altair-ha1-nfs.east/export/ds01/d107/u2/dh198349 >
./pthread_cond_timedwait_test
Thread about to do 60 sec pthread_cond_timedwait - send signals
<ctrl-Z>
[1]+  Stopped                 ./pthread_cond_timedwait_test
rt-x4200-dev-2 /net/altair-ha1-nfs.east/export/ds01/d107/u2/dh198349 > bg
[1]+ ./pthread_cond_timedwait_test &
Error: pthread_cond_wait returned early: 5 secs, 752436309 nsecs


Expected results:

Application should pause for 60 seconds then report "Returned ok"

The signal should either have no affect on pthread_cond_timedwait; or it is
allowed to cause a 'spurious wakeup' (return 0). No affect is the most desirable
of course.

Comment 1 David Holmes 2007-12-03 06:20:32 UTC
Created attachment 275431 [details]
Test program for pthread_cond_timedwait usage when signal arrives

Comment 2 Luis Claudio R. Goncalves 2007-12-03 17:28:57 UTC
After reading 2 times the manpage for pthread_cond_timedwait I am under the
impression that there is a missing bit in the example. The man page strongly
suggests the use o a boolean predicate along with pthread_cond_timedwait to
eliminate any ambiguity (some examples give in the manpage).

Would you mind to add this predicate to the test and retry it?

Exerpts from the manpage:

DESCRIPTION:
...
       When using condition variables there is always a Boolean predicate
involving shared variables asso-
       ciated  with  each  condition wait that is true if the thread should
proceed. Spurious wakeups from
       the pthread_cond_timedwait() or pthread_cond_wait() functions may  occur.
 Since  the  return  from
       pthread_cond_timedwait()  or  pthread_cond_wait()  does  not imply
anything about the value of this
       predicate, the predicate should be re-evaluated upon such return.

...
RATIONALE
   Condition Wait Semantics
       It is important to note that when pthread_cond_wait() and
pthread_cond_timedwait()  return  without
       error,  the  associated  predicate  may  still  be  false. Similarly,
when pthread_cond_timedwait()
       returns with the timeout error, the associated predicate may be true due
 to  an  unavoidable  race
       between the expiration of the timeout and the predicate state change.

       Some  implementations,  particularly  on a multi-processor, may sometimes
cause multiple threads to
       wake up when the condition variable is signaled simultaneously on
different processors.

       In general, whenever a condition wait returns, the thread has to
re-evaluate the predicate  associ-
       ated  with  the  condition  wait  to determine whether it can safely
proceed, should wait again, or
       should declare a timeout. A return from the wait does not imply that the
 associated  predicate  is
       either true or false.

       It  is  thus recommended that a condition wait be enclosed in the
equivalent of a "while loop" that
       checks the predicate.

...
   Timed Condition Wait
       The pthread_cond_timedwait() function allows an application to give up 
waiting  for  a  particular
       condition after a given amount of time. An example of its use follows:

              (void) pthread_mutex_lock(&t.mn);
                      t.waiters++;
                  clock_gettime(CLOCK_REALTIME, &ts);
                  ts.tv_sec += 5;
                  rc = 0;
                  while (! mypredicate(&t) && rc == 0)
                      rc = pthread_cond_timedwait(&t.cond, &t.mn, &ts);
                  t.waiters--;
                  if (rc == 0) setmystate(&t);
              (void) pthread_mutex_unlock(&t.mn);

       By  making  the timeout parameter absolute, it does not need to be
recomputed each time the program
       checks its blocking predicate.  If the timeout was relative, it would
have to be recomputed  before
       each  call.  This would be especially difficult since such code would
need to take into account the
       possibility of extra wakeups that result from extra broadcasts or signals
on the condition variable
       that occur before either the predicate is true or the timeout is due.



Comment 3 David Holmes 2007-12-03 22:58:06 UTC
Please read the example more carefully, there is a predicate - the wait is done
in a "while(!done)" loop.

Further, even without a predicate the example would still demonstrate the
problem. The predicate guards against conditions changing after signalling, or a
signal that wasn't indicating the particular change in condition that was being
waited for. In this case there is never a signal so the point is moot. If there
were a spurious wakeup then the return code would be zero - and the example
watches for spurious wakeups anyhow.



Comment 4 Luis Claudio R. Goncalves 2007-12-04 18:19:22 UTC
Oops, my bad.

I did some extra tests and have more information regarding the problem:

--[ STRACE LOG:
...
write(1, "Thread about to do 60 sec pthrea"..., 64Thread about to do 60 sec
pthread_cond_timedwait - send signals
) = 64
clock_gettime(CLOCK_MONOTONIC, {7412, 327460850}) = 0
futex(0x7fff3d7ab034, FUTEX_WAIT, 1, {59, 999738720}
[1]+  Stopped                 strace ./pthread_cond_timedwait_test
[lclaudio@lab tmp]$ bg
[1]+ strace ./pthread_cond_timedwait_test &
[lclaudio@lab tmp]$ ) = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGCONT (Continued) @ 0 (0) ---
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection
timed out)
clock_gettime(CLOCK_MONOTONIC, {7418, 2348371}) = 0
write(1, "Error: pthread_cond_wait returne"..., 65Error: pthread_cond_wait
returned early: 6 secs, 675148801 nsecs
) = 65
exit_group(-1)                          = ?

---[ STAP LOG

BEGIN[3633]  op:0  lock:0x00007ffff90b6934 
END[3633]    op:0  lock:0x00007ffff90b6934  return:-516 

BEGIN[3803]  op:0  lock:0x00007fff2692bdf4  
END[3803]    op:0  lock:0x00007fff2692bdf4  return:-516

From linux-2.6.21.x86_64/include/linux/errno.h, line 13:
             #define ERESTART_RESTARTBLOCK 516 /* restart by calling
sys_restart_syscall */



Comment 5 Steven Rostedt 2007-12-05 14:24:19 UTC
Created attachment 278331 [details]
patch to fix stack corruption in futex code

Using David's test I've discovered that the bug is also in current mainline. So
I started to debug there. I discovered that the restart block for signal
handling in the futex code was using a pointer to a variable on the stack,
which would be discarded on return of the function, and the retry would then
have corrupted data.

The full thread on that discussion and final patches are here:

  http://lkml.org/lkml/2007/12/4/332

This is a port of that patch to the RHEL-RT kernel.

Comment 6 Clark Williams 2007-12-14 15:24:46 UTC
Patch added to kernel-rt-2.6.21-58.el5rt

Comment 7 Luis Claudio R. Goncalves 2008-01-09 13:43:22 UTC
David, did the fix worked for your tests?
I will close this bug as "NEXTRELEASE" as we will soon release a new kernel,
with this fix, in the rt partners repository.

Please, feel free to reopen this if necessary.

Comment 8 David Holmes 2008-01-11 00:22:17 UTC
Luis, I haven't had access to updated bits to test the fix. I'm waiting for
kernel-rt-2.6.21-58.el5rt to be available.

David


Note You need to log in before you can comment on or make changes to this bug.