Bug 465156 - sys_futex(FUTEX_CMP_REQUEUE,...) support for normal to PI futexes (aka requeue pi)
sys_futex(FUTEX_CMP_REQUEUE,...) support for normal to PI futexes (aka requeu...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
1.0
x86_64 All
medium Severity low
: ---
: ---
Assigned To: John Kacur
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-10-01 16:32 EDT by IBM Bug Proxy
Modified: 2010-08-25 14:53 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-08-25 14:53:17 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Example of priority inversion deadlock caused by condvars (3.78 KB, text/plain)
2008-10-01 16:32 EDT, IBM Bug Proxy
no flags Details
Partial Patch - Not for inclusion (10.18 KB, text/plain)
2008-12-20 11:40 EST, IBM Bug Proxy
no flags Details
Partial Patch - Not for inclusion (V2) (30.44 KB, text/plain)
2009-01-01 15:20 EST, IBM Bug Proxy
no flags Details
test_requeue_pi.c (6.11 KB, text/plain)
2009-02-13 18:11 EST, IBM Bug Proxy
no flags Details
requeue_pi test case (7.31 KB, text/plain)
2009-02-27 00:51 EST, IBM Bug Proxy
no flags Details
Ftrace debugging patch for requeue_pi (8.32 KB, text/plain)
2009-03-06 19:51 EST, IBM Bug Proxy
no flags Details
Wait state transition diagram (127.83 KB, image/png)
2009-03-18 19:01 EDT, IBM Bug Proxy
no flags Details
requeue-pi testcase (7.44 KB, text/plain)
2009-03-24 04:51 EDT, IBM Bug Proxy
no flags Details
requeue-pi testcase w/ 3rd party blockers and timeouts (9.19 KB, text/plain)
2009-03-25 19:51 EDT, IBM Bug Proxy
no flags Details
requeue_pi testcase w/ absolute timeouts (9.30 KB, text/x-csrc)
2009-03-26 19:41 EDT, IBM Bug Proxy
no flags Details
kernel crash in machine_kexec (36.88 KB, image/png)
2009-06-11 12:11 EDT, IBM Bug Proxy
no flags Details

  None (edit)
Description IBM Bug Proxy 2008-10-01 16:32:02 EDT
=Comment: #0=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 
The current implementation of sys_futex(FUTEX_CMP_REQUEUE,...) doesn't support
moving tasks from normal futexes to PI futexes.  This path is required for
proper pthread condition variable support in glibc.  Current glibc (2.7 cvs)
bypasses the requeue code and simply does a wake_all on all the tasks queued on
the condition variable.

First we will define a design document to ensure we have the glibc and kernel
interactions thought out and done right.  With that, this piece (the kernel
side) should be fairly straight forward.

Testing can be accomplished via a simple C test that makes some direct
sys_futex() calls.  This will enable us to ensure the kernel side works while we
then work towards updating glibc to use it.
=Comment: #2=================================================
John G. Stultz <jstultz@us.ibm.com> - 

Example of priority inversion deadlock caused by condvars

From mail sent to lkml about this topic:

 So we've been seeing application hangs with a very threaded (~8k
threads) realtime java test. After a fair amount of debugging we found
most of the SCHED_FIFO threads are blocked in futex_wait(). This raised
some alarm, since futex_wait isn't priority-inheritance aware.

After seeing what was going on, Dino came up with a possible deadlock
case in the pthread_cond_wait() code.

The problem, as I understand it, assuming there is only one cpu, is if a
low priority thread is going to call pthread_cond_wait(), it takes the
associated PI mutex, and calls the function. The glibc implementation
acquires the condvar's internal non-PI lock, releases the PI mutex and
tries to block on futex_wait().

However if a medium priority cpu hog, and a high priority start up while
the low priority thread holds the mutex, the low priority thread will be
boosted until it releases that mutex, but not long enough for it to
release the condvar's internal lock (since the internal lock is not
priority inherited). 

Then the high priority thread will aquire the mutex, and try to acquire
the condvar's internal lock (which is still held). However, since we
also have a medium prio cpu hog, it will block the low priority thread
from running, and thus block it from releasing the lock.

And then we're deadlocked.

Thomas mentioned this is a known problem, but I wanted to send this
example out so maybe others might become aware.

The attached test illustrates this hang as described above when bound to
a single cpu. I believe its correct, but these sorts of tests often have
their own bugs that create false positives, so please forgive me and let
me know if you see any problems. :)

Many thanks to Dino, Ankita and Sripathi for helping to sort out this
issue.

To run:
	./pthread_cond_hang		  => will PASS (on SMP)
	taskset -c 0 ./pthread_cond_hang  => will HANG

=Comment: #3=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 
I think we are mixing two issues here. The problem described in the opening
comment of this bug is different from the one described in comment #2. Mixing
them is not necessarily a mistake, may be it is even desired, but I want to make
sure everyone tries to understand these two problems clearly. Hence I am
elaborating them here.

pthread_cond_t has two 'futexes' in it:

typedef union
{
  struct
  {
    int __lock;   <===== Used for serialization. 
    unsigned int __futex;  <==== Used for sleeping in the kernel.
    __extension__ unsigned long long int __total_seq;
<snip>
  } __data;
<snip>
} pthread_cond_t;

A call to pthread_cond_wait results in the following actions:

pthread_cond_wait (cond, mutex)
{
	Hold (cond->data.__lock)
	Release (mutex)
	Do some house keeping activities
	Release (cond->data.__lock)
	futex_wait (cond->data.__futex) <=== Makes sys_futex call to sleep in kernel
	Hold (cond->data.__lock)
	Do some house keeping activities
	Release (cond->data.__lock)
	Hold (mutex)
	return
}

The requeue operation refers to requeuing waiters from cond->data.__futex to
another futex. This could be the external mutex or it could be
cond->data.__lock. Currently requeue works only if the mutex is non-PI. Gilles
was trying to add functionality of requeing to PI mutexes.

The scenario we saw with OAK hang was slightly different. In this case, a low
prio thread holds the cond->data.__lock (before futex_wait), but it gets
preempted by the medium priority thread before it releases this lock. The high
prio thread now wants to do cond_wait on this condvar, but it cannot get
cond->data.__lock because the low prio thread holds this non-PI lock, but it
cannot run. This is a much more simple, straightforward priority inversion
scenario. Technically, this has got nothing to do with requeue.

One of the ideas is to solve both these problems at once with an alternative
design. Dinakar, Ankita and I had a discussion about this, but we could not
reach anywhere yet. 

Meanwhile, we will be very interested in knowing the discussion Darren is having
with Tglx. Are they happening on mails or some IRC channel?
=Comment: #4=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 
So I know everyone is very interested in this bug.  I haven't been holding
information back, just trying to process it before I posted it here.  I spoke
with tglx, ghaskins, and rostedt at LPC about this.  Following that discussion
tglx sent me an old mail he sent to Ulrich regarding an approach.

As Sripathi pointed out the problem described in Comment #2 could be solved by
making the cond.__data.__lock a PI mutex, avoiding the priority inversion. 
There is a larger issues however, that glibc can't requeue to a PI mutex because
the kernel doesn't support that and instead just wakes every task and let's them
race for the mutex on a broadcast.  By fixing the latter, we should be able to
fix the former "for free".

According to tglx, the core of the requeue problem is that the implementation of
the rtmutex in the kernel and the associated PI boosting mechanism require the
mutex to have an owner iff there are waiters.  The current implementation of
requeue_futex places all but one task as waiters on the outer futex, but returns
to userspace without assigning the one woken task as the owner.  When it returns
to userspace, the pthread_cond_broadcast() call returns for the requeue syscall
and releases the outer mutex.  The outer mutex now has waiters but no owner.  At
which point we have two scenarios, either the woken thread is still in flight or
is already enqueued as a waiter on the outer futex.

Quoting from tglx's mail:

"
In flight case:
The first requeued waiter is woken and is not allowed to acquire the
outer (pi-futex) before returning to user space. This includes that it
also is not allowed to take the ownership of the kernel side rtmutex,
which would leave the rtmutex without an owner. That's not bad when
there are no more waiters, but when there are more (maybe requeued)
waiters, then we run into a real problem. In fact we would need to wake
all the requeued waiters in one go in order not to let the rtmutex
without owner.

enqueued case:
We assume that the woken thread, which is now really blocked on the
outer futex, is the highest priority thread and can take the ownership
of the rtmnutex and the outer futex, but we run into the same problem as
above once this thread unlocks the outer futex and one of the requeued
waiters is woken.
"

The BULL folks prepared patches that set an "artificial owner", but the concern
is that it would be fragile and adds complexity to an already very complex bit
of code, likely breaking the robust semantics.

So, 3 pieces to fix:

1) Implement futex_requeue_pi in the kernel (not try and add more logic to
futex_requeue, we want to keep the paths separate).

2) Update the glibc routines to allow requeue of PI mutexes (we currently check
if PI and just wake all since requeue is known not to work).

3) Update the pthread_cond_attr structure and initializers to allow for a PI
flag.  This will be used at the construction of the condition variable to make
it PI aware from the beginning.

Thomas suggests exploring the possibility of requeuing to cond.__data.__lock,
rather than the outer mutex, at the same time initializing the __lock to be PI
aware if the attr specifies it (this eliminates the prio inversion from comment
#2).  In the PI case the __lock would need to then be held across the requeue
syscall (to ensure the underlying rtmutex always has an owner).

The next bit I have some questions on, but include for completeness:

"In the PI case we would always have an assigned owner for the futex and
the kernel side rtmutex. The only difference in the PI case would be,
that cond->__data.__lock is taken in kernel space on unlock, which is
quite simple to keep apart on the user space side.

In the cond_wait code I think something like:

        if (!pi || !pi_owner(cond->__data.__futex))
                lock(cond->__data.__futex)

would be sufficient.

We could probably optimize requeue_pi so that we requeue _ALL_ waiters
to cond->__data.__lock and avoid the contention of cond->__data.__lock
of the first woken thread completely. It would be woken on unlock of
cond->__data.__lock and get cond->__data.__lock right away.
"

I'm concerned that by requeuing to the __lock instead of the mutex, we will
delay futuer cond_wait() callers from getting enqueued on the cond var, and
therefor cause them to miss a cond_signal or a cond_broadcast.  I'm running this
by tglx now to see if I'm blowing smoke or not.
=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 
I think one of the issues with requeuing to __data.__lock (besides the delay
that you mentioned) would be the double semantics associated with the lock.
Currently, as Sripathi mentioned, this lock is meant to be used as a
synchronization primitive for accessing glibc internal cond var fields. So one
of the problem scenarios I could think of is this:

Like in OAK testcase scenario, we have seen a lot of contention for
__data.__lock. Considering this, there could be a case where we have a number of
tasks queued up on it (assuming it would be a priority-ordered list in the pi
case). So the tasks queued up here have not yet gone into the kernel to
_actually_ sleep on the cond_var.. they are simply queued due to serialization
in accessing the glibc variables. Now, parallely, if another thread does a
pthread_cond_broadcast (assuming the priority of the thread doing this is higher
than all the others already waiting on the __lock), it will get hold of this
lock and requeue the waiters on the internal futex (__data.__futex) to this
lock. Now, two types of tasks will end up being queued on the __data.__lock - 1)
Task still waiting to access the glibc variable and then sleep 2) Tasks already
sleeping on the cond_var, ie. on __futex. Now, if among all these tasks, the
highest priority one happens to be of type (1), it would get woken up on a
broadcast... is it not ? If yes, wouldn't that be a violation of the semantics ?
So the task of type 1 will wake only to go back to sleeping or waiting on the
cond var ? I could be missing something really subtle here, but still thought it
might be apt to point it out (will be a good learning atleast!).
=Comment: #6=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 
So some corrections to comment and answers to my and Ankita's questions.

Glibc pthread_cond_broadcast() does *not* currently requeue from futex to mutex
as I thought.  It already requeues from __futex to __lock.  The key aspect we
have to change is that for PI, __lock needs to be held when requeue and wait
return from the kernel.

As to contention on __lock, turns out this isn't really an issue afterall (and
not any different than how glibc works today in the non-pi case).  __lock will
be a PI lock, so the highest priority task will get the lock in all cases.  The
double usage Ankita mentioned (for access to cond data and to queue up for
__mutex) is a bit odd, but doesn't pose any practical problems.  Note that the
only time that new tasks will be blocked from getting _lock inside a cond_wait
call is the period of time following a broadcast while the requeued tasks
complete their critical sections.  And then only if the new task in cond_wait is
of lower priority than all the requeued tasks waiting for __mutex.

=Comment: #7=================================================
Ankita Garg <ankigarg@in.ibm.com> - 
(In reply to comment #6)
> So some corrections to comment and answers to my and Ankita's questions.
> 
> Glibc pthread_cond_broadcast() does *not* currently requeue from futex to mutex
> as I thought.  It already requeues from __futex to __lock.  The key aspect we
> have to change is that for PI, __lock needs to be held when requeue and wait
> return from the kernel.
> 

I think the requeue is being done to the external mutex... if I read the generic broadcast code
correctly ie.:

int
__pthread_cond_broadcast (cond)
     pthread_cond_t *cond;
{
...

      /* Wake everybody.  */
      pthread_mutex_t *mut = (pthread_mutex_t *) cond->__data.__mutex;
....
      /* lll_futex_requeue returns 0 for success and non-zero
     for errors.  */
      if (__builtin_expect (lll_futex_requeue (&cond->__data.__futex, 1,
                           INT_MAX, &mut->__data.__lock,
                           futex_val), 0))                          <== uaddr2 points to mut, the
external mutex of cond var
    {
      /* The requeue functionality is not available.  */
    wake_all:
      lll_futex_wake (&cond->__data.__futex, INT_MAX);
    }
...
}
=Comment: #8=================================================
Ankita Garg <ankigarg@in.ibm.com> - 
(In reply to comment #7)
> (In reply to comment #6)
> > So some corrections to comment and answers to my and Ankita's questions.
> > 
> > Glibc pthread_cond_broadcast() does *not* currently requeue from futex to mutex
> > as I thought.  It already requeues from __futex to __lock.  The key aspect we
> > have to change is that for PI, __lock needs to be held when requeue and wait
> > return from the kernel.
> > 
> 
> I think the requeue is being done to the external mutex... if I read the
> generic broadcast code correctly ie.:
> 
>       /* lll_futex_requeue returns 0 for success and non-zero
>      for errors.  */
>       if (__builtin_expect (lll_futex_requeue (&cond->__data.__futex, 1,
>                            INT_MAX, &mut->__data.__lock,
>                            futex_val), 0))                          <== uaddr2
> points to mut, the external mutex of cond var

And if the above is true, then the double semantics is something we should probably consider a
little more in depth. 
=Comment: #9=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 
(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #6)
> > > So some corrections to comment and answers to my and Ankita's questions.
> > > 
> > > Glibc pthread_cond_broadcast() does *not* currently requeue from futex to mutex
> > > as I thought.  It already requeues from __futex to __lock.  The key aspect we
> > > have to change is that for PI, __lock needs to be held when requeue and wait
> > > return from the kernel.
> > > 
> > 
> > I think the requeue is being done to the external mutex... if I read the
> > generic broadcast code correctly ie.:
> > 
> >       /* lll_futex_requeue returns 0 for success and non-zero
> >      for errors.  */
> >       if (__builtin_expect (lll_futex_requeue (&cond->__data.__futex, 1,
> >                            INT_MAX, &mut->__data.__lock,
> >                            futex_val), 0))                          <== uaddr2
> > points to mut, the external mutex of cond var
> 
> And if the above is true, then the double semantics is something we should
> probably consider a little more in depth. 
> 

Ankita, haha.  Yeah I read "__data.__lock" and thought cond.__data.__lock and missed the "mut"
prefix.  Thanks for catching that.  Sigh.  OK, so I don't believe the double semantics are an issue
for the same reasons as above, but as it is different from today, some careful analysis wouldn't
hurt.  Also, we need to investigate if new syscalls for the PI wait and requeue that come back to
userspace holding cond.__data.__lock would present in functional issues.

Discussion with Thomas resulted in the idea to create new glibc code paths:

cond_wait_pi()

cond_signal_pi()

cond_broadcast_pi()

I'll attach his latest mail here rather than trying to summarize it.

=Comment: #10=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 
Oops, forgot to mention two additional kernel code paths as well:

wait_requeue_pi()
requeue_pi_unlock()

Also detailed in the mail in the subsequent comment.
=Comment: #11=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 
Mail from tglx regarding proposed new code paths follows:

The current glibc simplified code does:

cond_wait(cond, mutex) <- call side mutex locked
{
	lock(cond->__data.__lock);

	unlock(mutex);

	do {
	   unlock(cond->__data.__lock);

	   wait(cond->__data.__futex);

	   lock(cond->__data.__lock);
	} while(...)

	unlock(cond->__data.__lock);

	lock(mutex);
}

cond_wait_timed() is same as above with timeout.

cond_signal(cond)
{
	lock(cond->__data.__lock);

	wake(cond->data.__futex);

	unlock(cond->__data.__lock);
}

cond_broadcast(cond)
{
	lock(cond->__data.__lock);

	unlock(cond->__data.__lock);

	requeue(cond->data.__futex, cond->__data.__lock);
}

My idea is to do the following for cond_pi:

cond_wait_pi(cond, mutex)
{
	lock_pi(cond->__data.__lock);

	unlock_pi(mutex);

	do {
	   unlock_pi(cond->__data.__lock);

	   res = wait_requeue_pi(cond->__data.__futex, cond->__data.__lock, ....);

	} while(...)

	unlock_pi(cond->__data.__lock);

	lock_pi(mutex);
}

same for cond_wait_timed_pi()

cond_signal_pi(cond)
{
	lock_pi(cond->__data.__lock);

	requeue_pi_unlock(cond->__data.__futex, , cond->__data.__lock, 1, 1, ..);
}

cond_broadcast_pi(cond)
{
	lock_pi(cond->__data.__lock);

	requeue_pi_unlock(cond->__data.__futex, cond->__data.__lock, 1, ALL, ..);
}

That needs separate code pathes in the glibc, but that's fine once we
defined, that condvars initialized with the pi attr behave this
way. We can not do that completely type transparent as it simply does
not work. The two things do not mix at all.

What we need on the kernel side is two new futex op codes:

     wait_requeue_pi and requeue_pi_unlock()

wait_requeue_pi()

waits on the cond->data.__futex, which is the
waitqueue and returns with cond->__data.__lock locked

Return values:

       0:        requeued cond->__data.__lock
       -ETIME:   timeout, no requeue happened
       others:   no requeue happened (on termination ....)

requeue_pi_unlock()

requeues 1 .. N waiters on the cond->__data.__futex to
cond->__data.__lock and unlocks the cond->__data.__lock in the kernel.

=Comment: #13=================================================
Ankita Garg <ankigarg@in.ibm.com> - 
(In reply to comment #4)

> 
> According to tglx, the core of the requeue problem is that the implementation of
> the rtmutex in the kernel and the associated PI boosting mechanism require the
> mutex to have an owner iff there are waiters.  The current implementation of
> requeue_futex places all but one task as waiters on the outer futex, but returns
> to userspace without assigning the one woken task as the owner.  When it returns
> to userspace, the pthread_cond_broadcast() call returns for the requeue syscall
> and releases the outer mutex.  The outer mutex now has waiters but no owner.  At
> which point we have two scenarios, either the woken thread is still in flight or
> is already enqueued as a waiter on the outer futex.
> 

Darren, could you pl elaborate on the above para a little more ? If I understand it correctly,
currently, across the pthread_cond_broadcast call, the external mutex is held. So, to work around
the limitation of the rtmutex implementation, if we define requeue_futex_pi to assign the one woken
up task as the owner of the external mutex and continue to requeue all the waiters onto it, wouldn't
that be sufficient ? So, on returning to userspace from the broadcast call, the owner or the woken
up task would do a pthread_mutex_unlock - causing the next waiter to be woken up and so on... I am
sure I am missing something very serious here... would greatly help if you could point it out for me..

=Comment: #14=================================================
Ankita Garg <ankigarg@in.ibm.com> - 
(In reply to comment #11)
> Mail from tglx regarding proposed new code paths follows:
> 
> The current glibc simplified code does:
> 
> cond_wait(cond, mutex) <- call side mutex locked
> {
>         lock(cond->__data.__lock);
> 
>         unlock(mutex);
> 
>         do {
>            unlock(cond->__data.__lock);
> 
>            wait(cond->__data.__futex);
> 
>            lock(cond->__data.__lock);
>         } while(...)
> 
>         unlock(cond->__data.__lock);
> 
>         lock(mutex);
> }
> 
> cond_wait_timed() is same as above with timeout.
> 
> cond_signal(cond)
> {
>         lock(cond->__data.__lock);
> 
>         wake(cond->data.__futex);
> 
>         unlock(cond->__data.__lock);
> }
> 
> cond_broadcast(cond)
> {
>         lock(cond->__data.__lock);
> 
>         unlock(cond->__data.__lock);
> 
>         requeue(cond->data.__futex, cond->__data.__lock);

Would just like to point out that the above needs to be: 

requeue(cond->data.__futex, cond->__data.__mutex.__lock);

Come to think of it, the issue of double semantics of cond->__data.__lock occurs even in the present
implementation - there could be tasks of diff types queued up on pthread_mutex_lock. And as Darren
pointed out, on looking a lil deeper think there are no practical issues associated with it. The
contention for the cond lock happens even today.. but with the lock being pi, we would ensure that
only the top priority task will ever hold the lock.

However, there is one issue thought I would bring up. It is about associating pi characteristics
with the internal cond lock. There are two ways this could be achieved:

1) Basing the decision depending on the type of the external mutex
2) Create new APIs to enable the user to specify the type of the cond var (PI or non-PI) [This
Darren mentioned in comment #4].

While approach (2) sounds technically sound and less complicated, we have to consider that all the
existing applications will still be susceptible to priority inversion. All the existing app codes
will need to be changed to code around this issue. If we dont care for old code, then yes, I would
also vote for (2). There is no drawback with using (1)... since POSIX mentions that if condition
variables are used without external mutex, the behavior is unpredictable. Thoughts ?
Comment 1 IBM Bug Proxy 2008-10-01 16:32:08 EDT
Created attachment 319162 [details]
Example of priority inversion deadlock caused by condvars
Comment 2 IBM Bug Proxy 2008-10-01 20:15:56 EDT
(In reply to comment #13)
> (In reply to comment #4)
>
> >
> > According to tglx, the core of the requeue problem is that the implementation of
> > the rtmutex in the kernel and the associated PI boosting mechanism require the
> > mutex to have an owner iff there are waiters.  The current implementation of
> > requeue_futex places all but one task as waiters on the outer futex, but returns
> > to userspace without assigning the one woken task as the owner.  When it returns
> > to userspace, the pthread_cond_broadcast() call returns for the requeue syscall
> > and releases the outer mutex.  The outer mutex now has waiters but no owner.  At
> > which point we have two scenarios, either the woken thread is still in flight or
> > is already enqueued as a waiter on the outer futex.
> >
>
> Darren, could you pl elaborate on the above para a little more ? If I
> understand it correctly, currently, across the pthread_cond_broadcast call, the
> external mutex is held.

Hrm... I don't think that is necessarily the case.  from pthread_cond_broadcast(3):

The pthread_cond_broadcast() or pthread_cond_signal() functions may  be
called  by  a  thread  whether  or not it currently owns the mutex that
threads calling pthread_cond_wait()  or  pthread_cond_timedwait()  have
associated  with the condition variable during their waits; ...

So we can't rely on being able to assign an owner to the outer mutex.

> So, to work around the limitation of the rtmutex
> implementation, if we define requeue_futex_pi to assign the one woken up task
> as the owner of the external mutex and continue to requeue all the waiters onto
> it, wouldn't that be sufficient ? So, on returning to userspace from the
> broadcast call, the owner or the woken up task would do a pthread_mutex_unlock
> - causing the next waiter to be woken up and so on... I am sure I am missing
> something very serious here... would greatly help if you could point it out for
> me..

It's a good thought, but I think the POSIX spec allowing the broadcast call to be made without the mutex held is what gets us here.


(In reply to comment #14)

> However, there is one issue thought I would bring up. It is about associating
> pi characteristics with the internal cond lock. There are two ways this could
> be achieved:
>
> 1) Basing the decision depending on the type of the external mutex
> 2) Create new APIs to enable the user to specify the type of the cond var (PI
> or non-PI) [This Darren mentioned in comment #4].
>
> While approach (2) sounds technically sound and less complicated, we have to
> consider that all the existing applications will still be susceptible to
> priority inversion. All the existing app codes will need to be changed to code
> around this issue. If we dont care for old code, then yes, I would also vote
> for (2).
> There is no drawback with using (1)... since POSIX mentions that if
> condition variables are used without external mutex, the behavior is
> unpredictable. Thoughts ?

Well, I think that we *do care* about old code, and for that reason we must opt for option 2.  The reason being that even though the behavior may be undefined, that doesn't mean applications aren't relying on it.  Its unfortunately, but apps rely on undefined behavior all the time.  sched_yield() anyone?

So to changing behavior for existing applications, I think option 2 is our only real option here.
Comment 3 IBM Bug Proxy 2008-10-01 20:30:48 EDT
So our next steps here need to be to evaluate the userspace (glibc) side of things and see if there is some existing detail of pthread_cond_(timed)wait and pthread_cond_(signal|broadcast) that would make it infeasible to lock/unlock the cond.__data.__lock inside the kernel.

From the IBM side, I'd like to have a few folks take a look at this and then we can get together and discuss our findings either on IRC or possible on the phone depending on what seems appropriate.  I'll follow-up in email to set-up a time.
Comment 4 IBM Bug Proxy 2008-10-03 08:33:03 EDT
(In reply to comment #14)
<snip>
> While approach (2) sounds technically sound and less complicated, we have to
> consider that all the existing applications will still be susceptible to
> priority inversion. All the existing app codes will need to be changed to code
> around this issue. If we dont care for old code, then yes, I would also vote
> for (2). There is no drawback with using (1)... since POSIX mentions that if
> condition variables are used without external mutex, the behavior is
> unpredictable. Thoughts ?

I didn't understand this. "condition variables are used without external mutex" Does this refer to cond_wait or cond_signal/broadcast? While it is okay to call cond_signal/broadcast without holding the external mutex, it is not true for cond_wait.


I am confused by comment #19.  It says "Well, I think that we *do care* about old code, and for that reason we must opt for option 2.", but I think it should be exactly the opposite. To help clear the confusion, I am summarizing the pros and cons of the two approaches here. This only considers the view from application/glibc, not the kernel parts.

1) Basing the decision depending on the type of the external mutex

Pros:
Old apps will automatically get the benefit of requeue_pi.
No user-visible API changes, no new non-POSIX APIs.
Cons:
User could use a convar with 2 different mutexes, one PI and one non-PI at the same time. AFAIK this is not illegal, but I need to confirm this from the docs.
User could use a condvar with a PI mutex for a while and a non-PI mutex later on (not in parallel). This is most likely legal.
We will have to see whether anything goes wrong if the first call made to the cond is cond_signal or cond_broadcast, rather than cond_wait.

2) Create new APIs to enable the user to specify the type of the cond var (PI or non-PI) [This Darren mentioned in comment #4].

Pros:
No guess work or heuristics in deciding the type of the cond->data.__lock.
The problem of using a condvar with 2 different mutexes (PI and non-PI) can still occur, but this time the blame is on the user.

Cons:
Old apps will need re-coding to take advantage of requeue_pi.
The new API will be non-POSIX for the time being, probably forever. App writers will be wary of using such an API.
Comment 5 IBM Bug Proxy 2008-10-03 14:17:45 EDT
(In reply to comment #21)
> (In reply to comment #14)
> <snip>
> > While approach (2) sounds technically sound and less complicated, we have to
> > consider that all the existing applications will still be susceptible to
> > priority inversion. All the existing app codes will need to be changed to code
> > around this issue. If we dont care for old code, then yes, I would also vote
> > for (2). There is no drawback with using (1)... since POSIX mentions that if
> > condition variables are used without external mutex, the behavior is
> > unpredictable. Thoughts ?
>
> I didn't understand this. "condition variables are used without external mutex"
> Does this refer to cond_wait or cond_signal/broadcast? While it is okay to call
> cond_signal/broadcast without holding the external mutex, it is not true for
> cond_wait.
>

Just referring th pthread_cond_signal() and pthread_cond_broadcast() here.  From the man page:

The pthread_cond_broadcast() or pthread_cond_signal() functions may  be
called  by  a  thread  whether  or not it currently owns the mutex that
threads calling pthread_cond_wait()  or  pthread_cond_timedwait()  have
associated  with the condition variable during their waits; however, if
predictable scheduling behavior is required, then that mutex  shall  be
locked    by    the    thread   calling   pthread_cond_broadcast()   or
pthread_cond_signal().

Note this isn't as strong as saying "Undefined" as in other places in the POSIX spec.  Here we're just saying that the wakeups won't be synchronized on the caller holding the mutex (since they don't).  So this IS a valid scenario.  Unlike having to threads wait on a cond var with different outer mutexes, that is "Undefined" behavior and we don't have to worry about what happens with our new implementations there.  In the case of a signal or broadcast without a locked mutex however, we still have to ensure the implementation is functional.
Comment 6 IBM Bug Proxy 2008-10-06 20:45:45 EDT
(In reply to comment #22)
> I am confused by comment #19.  It says "Well, I think that we *do care* about
> old code, and for that reason we must opt for option 2.", but I think it should
> be exactly the opposite. To help clear the confusion, I am summarizing the pros
> and cons of the two approaches here. This only considers the view from
> application/glibc, not the kernel parts.
>
> 1) Basing the decision depending on the type of the external mutex
>
> Pros:
>         Old apps will automatically get the benefit of requeue_pi.
>         No user-visible API changes, no new non-POSIX APIs.
> Cons:
>         User could use a convar with 2 different mutexes, one PI and one non-PI
> at the same time. AFAIK this is not illegal, but I need to confirm this from
> the docs.

By saying "not illegal" .. you are saying it is acceptable?
I am Not sure about the use case though.
One external lock to wait on the condvar .. and other external lock to release/wake them up?

>         User could use a condvar with a PI mutex for a while and a non-PI mutex
> later on (not in parallel). This is most likely legal.

Do we have any examples of this usage? AFAIK usually condvars are associated with mutexes. Interested in learning when we end up using this way.
Comment 7 IBM Bug Proxy 2008-10-06 23:45:36 EDT
(In reply to comment #24)
> (In reply to comment #22)
> > I am confused by comment #19.  It says "Well, I think that we *do care* about
> > old code, and for that reason we must opt for option 2.", but I think it should
> > be exactly the opposite. To help clear the confusion, I am summarizing the pros
> > and cons of the two approaches here. This only considers the view from
> > application/glibc, not the kernel parts.
> >
> > 1) Basing the decision depending on the type of the external mutex
> >
> > Pros:
> >         Old apps will automatically get the benefit of requeue_pi.
> >         No user-visible API changes, no new non-POSIX APIs.
> > Cons:
> >         User could use a convar with 2 different mutexes, one PI and one non-PI
> > at the same time. AFAIK this is not illegal, but I need to confirm this from
> > the docs.
>
> By saying "not illegal" .. you are saying it is acceptable?
> I am Not sure about the use case though.
> One external lock to wait on the condvar .. and other external lock to
> release/wake them up?
>

POSIX does not call the above use case scenario as illegal. Pl note that we are not using two mutexes with the cond var at the same time. The use case (whether acceptable or not) could be for example:

declare cv1, mut1, mut2;

init(mut1);
init(mut2);
init(cv1);

pthread_cond_wait(cv1, mut1);
....

pthread_cond_wait(cv2, mut2);

And as Darren pointed out from the man page,
"that is, a condition variable becomes
bound to a unique mutex when a thread waits on the condition  variable,
and this (dynamic) binding shall end when the wait returns."

So since the binding ends at the end of pthread_cond_wait call, the user is free to re-use the same cond var with another mutex with different attributes (or same).

> >         User could use a condvar with a PI mutex for a while and a non-PI mutex
> > later on (not in parallel). This is most likely legal.
>
> Do we have any examples of this usage? AFAIK usually condvars are associated
> with mutexes. Interested in learning when we end up using this way.
>

I am not sure if we have existing examples for this case, but you could think of it as a variable being re-used. Instead of having cv1 & cv2 with two mutexes mut1 & mut2, you use just cv1 for both mut1 & mut2, at different point in time of the code. Like using the same variable "i" for two consecutive 'for' loops.
Comment 8 IBM Bug Proxy 2008-10-07 01:01:00 EDT
On the call, there was some discussion regarding cancellation processing interfering with the proposed implementation. Looked a bit into the cancellation processing being done inside pthread_cond_[timed]wait. As we know, pthread_cancel is the API that can be used by the user to cancel/terminate a thread. As per the man page

"Depending on its settings, the target thread can then either ignore the request, honor it immediately, or defer it until it reaches a cancellation point."

Also,

"Cancellation points are those points in the program execution where a test for pending cancellation requests is performed and cancellation is executed if positive. The following POSIX threads functions are cancellation points:

pthread_join(3)
pthread_cond_wait(3)
pthread_cond_timedwait(3)
pthread_testcancel(3)
sem_wait(3)
sem_timedwait(3)
sigwait(3)"

In pthread_cond_wait, the call to __pthread_cleanup_push, installs __condvar_cleanup as the cancellation handler. However, the actual test for any pending cancellation requests is done here:

__pthread_cond_wait(..)
{

...
do
{
...
/* Prepare to wait.  Release the condvar futex.  */
lll_mutex_unlock (cond->__data.__lock);

/* Enable asynchronous cancellation.  Required by the standard. * */
cbuffer.oldtype = __pthread_enable_asynccancel ();

<<<<< __pthread_enable_asyncancel checks if there is a pending
cancellation request. If there is, the thread is cancel and a cancel is
equivalent to pthread_exit. So most cleanup should be taken care of
(provided the locks are robust in nature). Note here that the
cancellation handler is being called after the __lock has been unlocked.
>>>>>

/* Wait until woken by signal or broadcast.  */
lll_futex_wait (&cond->__data.__futex, futex_val);

/* Disable asynchronous cancellation.  */
__pthread_disable_asynccancel (cbuffer.oldtype);

/* We are going to look at shared data again, so get the lock.  */
lll_mutex_lock (cond->__data.__lock);
....
}
while (val == seq || cond->__data.__woken_seq == val);
...
}

So looks like it will not be an issue with the proposed
implementation... I could be missing something :-)
Comment 9 IBM Bug Proxy 2008-10-07 02:00:43 EDT
Some findings with regards to the other concerns we had:

1) pshared mutexes: From the man page, process-shared attribute lets a mutex to be operated upon by any other thread that has access to the memory where the mutex is allocated. So for our current problem, not much of a concern. And also for documentation purposes, from the broadcast code:

int
__pthread_cond_broadcast (cond)
pthread_cond_t *cond;
{
.....
/* Do not use requeue for pshared condvars.  */
if (cond->__data.__mutex == (void *) ~0l)
goto wake_all;

....
}

So, if the external mutex is pshared, then requeue is not performed.

2) Robust condvar: Of all the attributes available for mutexes, pi and robust are the two most important ones for us. Support for robust __lock would enable the pi wait_list of the waiters to be cleaned up at the time of exit.
Comment 10 IBM Bug Proxy 2008-12-12 13:20:37 EST
Apologies for the long delay in updating.  I have had several discussion with Thomas G. on this and am currently working on implementing the latest of those with feedback from Ulrich.  Brief summary (to be followed with patches next week):

o glibc will continue to requeue to the outer mutex
o for the first pass, requeue_pi will only occur if the broadcaster holds the mutex associated with the cond var, otherwise the current wake all behavior will be preserved.
o two new kernel futex ops will be added:
FUTEX_REQUEUE_PI (requeue from one futex to another and setup the top waiter to wake owning the latter)
FUTEX_WAIT_REQUEUE_PI (works in conjunction with FUTEX_REQUEUE_PI to lock the outer mutex prior to returning to userspace)
Comment 11 IBM Bug Proxy 2008-12-20 11:40:50 EST
Created attachment 327542 [details]
Partial Patch - Not for inclusion



I've been iterating on this patch for a while and thought I should share where I am so far.  This patch applies against linux-tip core/futexes after a couple other general fixes which I've sent out to LKML.  The basic idea is to leverage the existing futex_requeue code and add in a futex_requeue_pi_init call to handle the takeover or FUTEX_WAITERS update of the target pifutex.  This call is basically the first half of futex_lock_pi, but performed on behalf of the top_waiter.  Left to do is confirm the proper fault handling for r/w access to the pifutex (uaddr2) and to write the wait_futex_requeue_pi code which will perform the second half of futex_lock_pi().  I'll post the next revision once I have something more or less complete.
Comment 12 IBM Bug Proxy 2009-01-01 15:20:48 EST
Created attachment 328045 [details]
Partial Patch - Not for inclusion (V2)



So I couldn't stay away and a put a few more hours into this this
morning.  I've attached the current (still incomplete patch, doesn't build. etc.) for your
reference.  Note that this patch applied against linux-tip git, branch /core/futexes

o futex_requeue goes so far as to queue the top_waiter on uaddr2 and setup the pi state.  
o futex_wait_requeue_pi starts off where futex_requeue left off in the event of a broadcast, or it starts with effectively the beginning of futex_lock_pi in th event of a signal (where requeue wouldn't have been run).

TODO: 
o review fault handling, especially EFAULT 
o cleanup return paths
o refactor the atomic (first half) bit of futex_lock_pi and reuse the new function in futex_lock_pi, futex_wait_requeue_pi (when signaled), and possibly in futex_requeue_pi_init.

Concerns: 
o This has horribly complicated the fault handling and return paths, and that is something I will definitely have to clean up before I present this to anyone else :-) 
o they futex_key puts/gets in
futex_wait_requeue_pi gets kinda hairy.  I think I have some flaws in there still where we drop the hb lock and don't do a get_futex_key() again - which will complicate the return paths even more.
Comment 13 IBM Bug Proxy 2009-01-16 20:30:48 EST
To keep everyone up to speed.  I've been out sick for a little over a day.  I've received some feedback on V1 of the patch and have been reworking it to incorporate those changes.  I currently believe I have all the major pieces in place, and am now going through cleanup, refinement, and some refactoring.  I'll try to get it to a compilable state on Monday and share the patchset here (likely still in developer format and not all broken out into functional chunks).
Comment 14 IBM Bug Proxy 2009-02-09 19:41:04 EST
I'm currently looking at implementing rt_mutex_enqueue_task().  Reading up on rt_mutex details and analyzing the code to see how to go about it "safely" :-)  Splitting up the rt_mutex_slowlock() routine across rt_mutex_enqueue_task() and rt_mutex_handle_wakeup() is tricky, especially with regards to the storage of the rt_mutex_waiter() structure which I don't think can be stored on the task's stack.  Considering hanging it off the task_struct for now, but may even look at sticking it on the futex_q - trouble with that is it limits the use of the calls to the futex code, which I don't think is acceptable.  Continuing to dig, hope to include a preliminary version in V4 of the patchset for review.
Comment 15 IBM Bug Proxy 2009-02-10 20:40:54 EST
I've got a preliminary version of rt_mutex_enqueue_task() and rt_mutex_handle_wakeup() which are basically the two halves of rt_mutex_slowlock().  One tough spot was figuring out where to store the rt_mutex_waiter as it is now needed by two different tasks.  I've made the rt_mutex API use task and waiter, and then stored the waiter in the futex_q.  If other areas want to use this, they will need to find their own storage location, but I think this is reasonable, and it avoide having to add more clutter to the task struct.  I need to review the storage for safe access and verify that signals/wakeups don't break things.  After that I'll send out V4 of the patch set for peer review.
Comment 16 IBM Bug Proxy 2009-02-13 18:11:00 EST
Created attachment 331883 [details]
test_requeue_pi.c



This is an early (and primitive) test case that passes with V4pre4 of the patchset.  To build:

$ gcc -D_GNU_SOURCE -lrt -lpthread test_requeue_pi.c -o test_requeue_pi
Comment 17 IBM Bug Proxy 2009-02-16 20:31:27 EST
I've been working to incorporate some feedback from tglx.  I've also fixed a minor bug which resulted in a warning while running the test case.  The test case now runs without complaint from the kernel.  I should note that the test case is very crude and only covers a very simplistic version of requeue_pi.  A much more exhaustive test case is needed, and I do have something in the works there.  Once I have completed incorporating tglx's feedback I'll release V4 of the patch-set.  If anyone wants to see the current development tree, please just ping me.
Comment 18 IBM Bug Proxy 2009-02-25 19:31:11 EST
I've updated the latest kernel patchset with a fix to futex_requeue() to allow for cond_signal using nr_requeue=0.  The latest patchset is available here:

http://git.kernel.org/?p=linux/kernel/git/dvhart/linux-2.6-tip-hacks.git;a=shortlog;h=requeue-pi-dev

I'll attach the updated testcase here next.
Comment 19 IBM Bug Proxy 2009-02-27 00:51:22 EST
There's a bug that seems to only appear during the cond_signal  style wakeup (one at a time with no requeue).  __rt_mutex_slowlock will return with ret=0 and waiter->task != NULL, which it should never do (ret=0 means it got the lock, in which case waiter->task has to be NULL).  I started looking into this, but didn't get very far today.
Comment 20 IBM Bug Proxy 2009-02-27 00:51:30 EST
Created attachment 333437 [details]
requeue_pi test case


------- Comment (attachment) From dvhltc@us.ibm.com 2009-02-25 19:31 EDT-------


I haven't quite got the public git tree up for this testcase, so I've attached it here.  This test case tests both cond_broadcast and cond_signal type futex_requeue usage.  With the latest kernel patches from the git tree as of today, this test is passing both tests.
Comment 21 IBM Bug Proxy 2009-02-27 20:21:03 EST
Some status on my debugging.  Turns out that __rt_mutex_slowlock() starts running with waiter->task being non-null, but for some reason try-to-take-rtmutex() is still succeeding and it really shouldn't.  So we need to find out why waiter->task is not null and also why try-to-take-rt-mutex() succeeds.  Oddly, this only occurs during the cond_signal style wakeup from the attached testcase. I'll continue looking into this on Monday.

I thought this might be a race between the wake we do in futex_requeue() and the wake the unlock() does once the waker returns to userspace.  Some hacks to test for this and try to avoid it in the kernel and the testcase didn't prevent the bug ( or did, but then hung the thread until a signal woke it and that hit the bug).
Comment 22 IBM Bug Proxy 2009-03-04 03:01:09 EST
Discovered that the race results in the slow unlock path being skipped by the signaling thread!!! This is at least progress.  Now just need to see WHY it is able to take the fastpath since that is what rt_mutex_start_proxy_lock is supposed to prevent.
Comment 23 IBM Bug Proxy 2009-03-05 01:32:01 EST
Spent a half day learning ftrace.  Got some interesting data.  Found that the unsynced TSCs of the opteron systems make the output less useful.  I'm looking at using an Intel system to see if I can get better data.
Comment 24 IBM Bug Proxy 2009-03-05 02:21:24 EST
I've setup on elm3b160 (a Zpro with Intel procs).  I was easily able to reproduce there with the signaling thread returning to userspace and calling the fastunlock path (instead of slow which it should be forced to do as the rtmutex should have waiters).  here's the trace:

------------[ cut here ]------------
kernel BUG at kernel/rtmutex.c:646!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0:
0/end_device-0:0/target0:0:0/0:0:0:0/vendor
Dumping ftrace buffer:
---------------------------------
<...>-3783    3d..3 594121606us : lookup_pi_state: lookup_pi_state: allocating a new pi state
<...>-3783    3...2 594121609us : futex_requeue: futex_requeue_pi_init: futex_lock_pi_atomic returned: 0
<...>-3783    3...2 594121610us : futex_requeue: futex_requeue: futex_requeue_pi_init returned: 0
<...>-3783    3...1 594121620us : rt_mutex_unlock: rt_mutex_unlock: unlocking ffff88023b98d250
<...>-3783    3...1 594121621us : rt_mutex_unlock: rt_mutex_fastunlock: unlocked ffff88023b98d250
<...>-3773    0...1 594121627us : __rt_mutex_slowlock: __rt_mutex_slowlock: waiter->task is ffff88023d4720c0
<...>-3773    0...1 594121630us : try_to_take_rt_mutex: try_to_take_rt_mutex: assigned rt_mutex (ffff88023b98d250) owner to current ffff88023d4720c0
<...>-3773    0...1 594121631us : __rt_mutex_slowlock: __rt_mutex_slowlock: got the lock
---------------------------------
CPU 0
Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac parport_pc lp parport floppy snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss e1000 snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss sg snd_mixer_oss snd_pcm sr_mod cdrom tg3 snd_timer serio_raw snd button libphy pata_acpi soundcore i5000_edac i2c_i801 ata_generic iTCO_wdt edac_core i2c_core pcspkr iTCO_vendor_support shpchp snd_page_alloc ata_piix libata mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 3773, comm: requeue_pi Not tainted 2.6.29-rc6-ftrace01 #8 9228MC1
RIP: 0010:[<ffffffff812c8bc8>]  [<ffffffff812c8bc8>] __rt_mutex_slowlock+0x80/0x1ac
RSP: 0018:ffff88023b80fbb8  EFLAGS: 00010286
RAX: 0000000000000022 RBX: ffff88023b80fd58 RCX: ffff88023f246580
RDX: ffff88023b80fa78 RSI: 0000000000000202 RDI: ffffffff815df520
RBP: ffff88023b80fbe8 R08: 0000000000000202 R09: 0000000000000000
R10: 0000000000000010 R11: 0000000000000000 R12: ffff88023b98d250
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000040ab0940(0063) GS:ffffffff81575080(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000168a6d0 CR3: 000000023adc2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process requeue_pi (pid: 3773, threadinfo ffff88023b80e000, task ffff88023d4720c
0)
Stack:
0000000100000001 ffff88023b98d250 ffff88023b80fd58 0000000000000001
0000000000000000 ffff88023b80fc58 ffff88023b80fc18 ffffffff81060435
0000000000000001 0000000000000000 ffff88023b80fca0 00000000006017c0
Call Trace:
[<ffffffff81060435>] rt_mutex_finish_proxy_lock+0x57/0x102
[<ffffffff8105e0d5>] futex_wait_requeue_pi+0x34b/0x52c
[<ffffffff811c5da8>] ? n_tty_receive_buf+0xdae/0xe45
[<ffffffff81035563>] ? finish_task_switch+0x4a/0x8e
[<ffffffff8105f0f5>] do_futex+0x7fa/0x880
[<ffffffff81030160>] ? __wake_up+0x48/0x55
[<ffffffff811c7c04>] ? tty_ldisc_deref+0x6e/0x73
[<ffffffff811c2907>] ? tty_write+0x211/0x22c
[<ffffffff8105f296>] sys_futex+0x11b/0x139
[<ffffffff8100c2cb>] system_call_fastpath+0x16/0x1b
Code: 16 6f d9 ff 85 c0 74 31 31 c0 48 c7 c2 c0 f7 2d 81 48 c7 c6 36 5a 3b 81 48
c7 c7 a5 8b 2c 81 e8 49 57 dc ff 48 83 7b 50 00 74 04 <0f> 0b eb fe e8 84 19 dc
ff e9 dd 00 00 00 41 ff c6 31 c0 48 c7
RIP  [<ffffffff812c8bc8>] __rt_mutex_slowlock+0x80/0x1ac
RSP <ffff88023b80fbb8>
---[ end trace e7b2f4c69298dca7 ]---
note: requeue_pi[3773] exited with preempt_count 1
Comment 25 IBM Bug Proxy 2009-03-05 12:01:40 EST
I've updated my ftracing patch, the new output on the intel box shows that the start_proxy_lock code is not setting RT_MUTEX_HAS_WAITERS like it should.  I'll start looking into why.

------------[ cut here ]------------
kernel BUG at kernel/rtmutex.c:646!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0:
0/end_device-0:0/target0:0:0/0:0:0:0/vendor
Dumping ftrace buffer:
---------------------------------
<...>-3793    1d..3 558351872us : lookup_pi_state: allocating a new pi state
<...>-3793    1d..3 558351876us : lookup_pi_state: initial rt_mutex owner: ffff88023d9486c0
<...>-3793    1...2 558351877us : futex_requeue: futex_lock_pi_atomic returned: 0
<...>-3793    1...2 558351877us : futex_requeue: futex_requeue_pi_init returned: 0
<...>-3793    1...3 558351879us : rt_mutex_start_proxy_lock: task_blocks_on_rt_mutex returned 0
<...>-3793    1...3 558351880us : rt_mutex_start_proxy_lock: lock has waiterflag: 0
<...>-3793    1...1 558351888us : rt_mutex_unlock: unlocking ffff88023b5f6950
<...>-3793    1...1 558351888us : rt_mutex_unlock: lock waiter flag: 0
<...>-3793    1...1 558351889us : rt_mutex_unlock: unlocked ffff88023b5f6950
<...>-3783    0...1 558351893us : __rt_mutex_slowlock: waiter->task is ffff88023c872440
<...>-3783    0...1 558351897us : try_to_take_rt_mutex: assigned rt_mutex (ffff88023b5f6950) owner to current ffff88023c872440
<...>-3783    0...1 558351897us : __rt_mutex_slowlock: got the lock
---------------------------------
Comment 26 IBM Bug Proxy 2009-03-05 21:01:42 EST
Turns out I wasn't setting RT_MUTEX_HAS_WAITERS in rt_mutex_start_proxy_lock().  I've updated the patch on my git tree and sent it to LKML for review as well.  Sripathi, would you give your glibc tests another go today with the new code?  elm3b160 is running the new kernel if you would like to just hop on that.
Comment 27 IBM Bug Proxy 2009-03-06 01:21:48 EST
(In reply to comment #50)
> Turns out I wasn't setting RT_MUTEX_HAS_WAITERS in rt_mutex_start_proxy_lock().
>  I've updated the patch on my git tree and sent it to LKML for review as well.
> Sripathi, would you give your glibc tests another go today with the new code?
> elm3b160 is running the new kernel if you would like to just hop on that.

I tried my glibc (libpthread.so, actually) and the simple cond_wait/cond_broadcast test case on elm3b160. Please note, this glibc is still completely untested, unverified, because I have not had a successful run with this yet. OTOH, an imperfect glibc may be good as of now.. it might catch more kernel errors ;-)

When I ran the test I saw these messages in my terminal (not console):
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel: Oops: 0002 [#1] PREEMPT SMP
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel: Stack:
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel:  00000001de89dd88 0000000000601784 ffffffff8160c5c8 ffff8801de8c7c58
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel:  00000000de942340 0000000000000000 0000000000601000 ffff8801de94a100
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel: Call Trace:
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel:  [<ffffffff811b796a>] ? pty_write+0x3e/0x48
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel:  [<ffffffff8105cc25>] do_futex+0x82c/0x878
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel:  [<ffffffff8102f556>] ? __wake_up+0x43/0x4f
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel:  [<ffffffff811b6936>] ? tty_ldisc_deref+0x6e/0x73
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel:  [<ffffffff811b1d08>] ? tty_write+0x211/0x22c
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel:  [<ffffffff8105cd8c>] sys_futex+0x11b/0x139
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel:  [<ffffffff8100c0db>] system_call_fastpath+0x16/0x1b
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel: Code: 7d 10 00 74 59 49 8b 44 24 68 8b 5d cc 48 8b b8 a8 02 00 00 81 e3 ff ff ff 3f e8 83 0b ff ff 39 c3 74 3b 48 8b 45 c0 48 83 c0 48 <f0> ff 00 48 8b 45 c0 49 8b 54 24 68 b9 01 00 00 00 49 8b 74 24
Message from syslogd@ at Fri Mar  6 01:02:44 2009 ...
elm3b160 kernel: CR2: 0000000000000048

The system was still responding and I saw this in dmesg:

lookup_pi_state: allocating a new pi state
lookup_pi_state: allocating a new pi state
<lots and lots of the above message>

BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
IP: [<ffffffff8105c236>] futex_requeue+0x403/0x5c6
PGD 1de5bf067 PUD 1de5be067 PMD 0
Oops: 0002 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor
Dumping ftrace buffer:
(ftrace buffer empty)
CPU 3
Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac parport_pc lp parport floppy snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device e1000 sg snd_pcm_oss snd_mixer_oss snd_pcm snd_timer tg3 sr_mod snd_page_alloc snd_hwdep cdrom libphy snd serio_raw iTCO_wdt pata_acpi i5000_edac i2c_i801 button shpchp iTCO_vendor_support edac_core i2c_core ata_generic soundcore pcspkr ata_piix libata mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 21264, comm: test_requeue_pi Not tainted 2.6.28-rc6-dvh04 #12
RIP: 0010:[<ffffffff8105c236>]  [<ffffffff8105c236>] futex_requeue+0x403/0x5c6
RSP: 0018:ffff8801de89dd38  EFLAGS: 00010206
RAX: 0000000000000048 RBX: 000000003fff8802 RCX: 0000000000005306
RDX: ffff8801de94a100 RSI: ffffffff8142c540 RDI: ffff88023e40a640
RBP: ffff8801de89de08 R08: ffff8801de8b6440 R09: 00000000f7a1335f
R10: 0000000000000005 R11: ffff8801de89dce8 R12: ffff8801de529c58
R13: ffffffff8160c840 R14: ffffffff8160c5c0 R15: 0000000000000000
FS:  000000004669c940(0063) GS:ffff88023eb3cac0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000048 CR3: 00000001de843000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process test_requeue_pi (pid: 21264, threadinfo ffff8801de89c000, task ffff8801de942340)
Stack:
ffff8800280687e0 ffff8801de89def8 000000017fffffff 00000000006017c0
00000001de89dd88 0000000000601784 ffffffff8160c5c8 ffff8801de8c7c58
00000000de942340 0000000000000000 0000000000601000 ffff8801de94a100
Call Trace:
[<ffffffff811b796a>] ? pty_write+0x3e/0x48
[<ffffffff8105cc25>] do_futex+0x82c/0x878
[<ffffffff8102f556>] ? __wake_up+0x43/0x4f
[<ffffffff811b6936>] ? tty_ldisc_deref+0x6e/0x73
[<ffffffff811b1d08>] ? tty_write+0x211/0x22c
[<ffffffff8105cd8c>] sys_futex+0x11b/0x139
[<ffffffff8100c0db>] system_call_fastpath+0x16/0x1b
Code: 7d 10 00 74 59 49 8b 44 24 68 8b 5d cc 48 8b b8 a8 02 00 00 81 e3 ff ff ff 3f e8 83 0b ff ff 39 c3 74 3b 48 8b 45 c0 48 83 c0 48 <f0> ff 00 48 8b 45 c0 49 8b 54 24 68 b9 01 00 00 00 49 8b 74 24
RIP  [<ffffffff8105c236>] futex_requeue+0x403/0x5c6
RSP <ffff8801de89dd38>
CR2: 0000000000000048
---[ end trace 2376de034cbbad6b ]---
note: test_requeue_pi[21264] exited with preempt_count 2
lookup_pi_state: allocating a new pi state
lookup_pi_state: allocating a new pi state
...
...

The system then stopped responding and there were various soft lockup messages on the console. I have copied one of them here:

BUG: soft lockup - CPU#1 stuck for 61s! [requeue_pi:21441]
Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac parport_pc lp parport floppy snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device e1000 sg snd_pcm_oss snd_mixer_oss snd_pcm snd_timer tg3 sr_mod snd_page_alloc snd_hwdep cdrom libphy snd serio_raw iTCO_wdt pata_acpi i5000_edac i2c_i801 button shpchp iTCO_vendor_support edac_core i2c_core ata_generic soundcore pcspkr ata_piix libata mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
CPU 1:
Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac parport_pc lp parport floppy snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device e1000 sg snd_pcm_oss snd_mixer_oss snd_pcm snd_timer tg3 sr_mod snd_page_alloc snd_hwdep cdrom libphy snd serio_raw iTCO_wdt pata_acpi i5000_edac i2c_i801 button shpchp iTCO_vendor_support edac_core i2c_core ata_generic soundcore pcspkr ata_piix libata mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 21441, comm: requeue_pi Tainted: G      D    2.6.28-rc6-dvh04 #12
RIP: 0010:[<ffffffff812b2670>]  [<ffffffff812b2670>] _spin_lock+0x28/0x2e
RSP: 0018:ffff8801de893d98  EFLAGS: 00000293
RAX: 000000000000cecc RBX: ffff8801de893d98 RCX: 0000000059d4d981
RDX: 0000000098c8b25c RSI: 00000000102f9562 RDI: ffffffff8160c840
RBP: ffff8801de893d98 R08: 00000000d0a4c4ae R09: 0000000000a96dff
R10: 0000000000000005 R11: ffff8801de893d58 R12: ffff8801de92e500
R13: 0000000000002100 R14: ffff8801de893d78 R15: 0000000000000000
FS:  00007f0695bc86e0(0000) GS:ffff88023eaadcc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000013bf6a0 CR3: 00000001dfd81000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
[<ffffffff812b2651>] ? _spin_lock+0x9/0x2e
[<ffffffff8105ac3d>] futex_wake+0x79/0xf9
[<ffffffff8108a3c4>] ? cpupri_set+0xca/0xdd
[<ffffffff8105c4b4>] do_futex+0xbb/0x878
[<ffffffff8103285c>] ? enqueue_task_rt+0x45/0x61
[<ffffffff8102e652>] ? enqueue_task+0x50/0x5b
[<ffffffff8105da6d>] ? rt_mutex_adjust_pi+0x3c/0x60
[<ffffffff810356f5>] ? __sched_setscheduler+0x2ef/0x329
[<ffffffff8105cd8c>] sys_futex+0x11b/0x139
[<ffffffff8100c0db>] system_call_fastpath+0x16/0x1b
Comment 28 IBM Bug Proxy 2009-03-06 19:02:47 EST
Ftrace from my instrumented kernel:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
IP: [<ffffffff8105c2af>] futex_requeue+0x43e/0x5f6
PGD 23d94f067 PUD 23a5f0067 PMD 0
Oops: 0002 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor
Dumping ftrace buffer:
---------------------------------
<...>-3513    0...2 2943135757us : futex_requeue: futex_lock_pi_atomic returned: 1
<...>-3513    0...2 2943135759us : futex_requeue: futex_requeue_pi_init returned: 1
---------------------------------

# addr2line -e vmlinux ffffffff8105c2af
/test/dvhart/requeue_pi/linux-2.6-tip-hacks/arch/x86/include/asm/atomic_64.h:97

Not terribly helpful.  I'll instrument futex_requeue and the two functions caught in the trace a bit more and see what I can learn.
So I am assuming some bad arguments from the patched glibc library.  Like you said, it's good to have a less-than-perfect userspace lib in order to make sure the kernel handles bad stuff :-)


(In reply to comment #53)
> Dumping ftrace buffer:
> ---------------------------------
>    <...>-3513    0...2 2943135757us : futex_requeue: futex_lock_pi_atomic
> returned: 1

This indicates that we acquired the lock on behalf of the waiting thread.  This means the waking thread does not hold the mutex.  While this is "supported", it isn't something I have tested.  I'll work on why this is failing - but please take a look at the test case and update it to take the lock first (our final test case will have to try with and without to make sure both paths work of course).

>    <...>-3513    0...2 2943135759us : futex_requeue: futex_requeue_pi_init
> returned: 1

I don't know how that can happen - it _should_ be only 0 on success or a negative error code on failure... hrmmm...  time for some more code review I think :-)
Comment 29 IBM Bug Proxy 2009-03-06 19:41:05 EST
OK, this was as far as I could get today:

---------------------------------
<...>-3460    0.... 499445852us : futex_requeue: Called with: uaddr1=0000000000601784 fshared=1 uaddr2=00000000006017c0 nr_wake=1 nr_requeue=2147483647 cmpval=ffff88023c91bef8 requeue_pi=1
<...>-3460    0...2 499445856us : futex_requeue: had cmpval, curval=20u
<...>-3460    0...2 499445857us : futex_requeue: top_waiter: ffff880237ce3c58
<...>-3460    0...2 499445858us : futex_lock_pi_atomic: curval: 0u
<...>-3460    0...2 499445859us : futex_lock_pi_atomic: Surprise, we got the lock
<...>-3460    0...2 499445859us : futex_requeue: futex_lock_pi_atomic returned: 1
<...>-3460    0...2 499445860us : futex_requeue: futex_requeue_pi_init returned: 1
---------------------------------

Note that futex_lock_pi_atomic() reports curval=0, this confirms that the target futex of the requeue has no owner.  I'll start looking at why we fail in this case on Monday.  I'll attach my ftrace debug patch next as well in case someone wants to work with it.
Comment 30 IBM Bug Proxy 2009-03-06 19:51:04 EST
Created attachment 334375 [details]
Ftrace debugging patch for requeue_pi


------- Comment (attachment) From dvhltc@us.ibm.com 2009-03-06 19:41 EDT-------


To use this patch, you need to enable ftrace in your config, then:

Add ftrace_dump_on_oops to your kernel boot command line

# mount -t debugfs nodev /sys/kernel/debug
# echo ftrace_printk > /sys/kernel/debug/tracing/iter_ctrl

 Then run your test and monitor the output of the console, on oops the contents of /sys/kernel/debug/tracing/trace are dumped to the console.  If you hang, but don't oops, you can cat the contents of the file at any time.
Comment 31 IBM Bug Proxy 2009-03-18 19:01:57 EDT
Created attachment 335780 [details]
Wait state transition diagram


------- Comment on attachment From dvhltc@us.ibm.com 2009-03-18 18:50 EDT-------


I've been hard at work trying to map out the state transition diagram per John Stultz's recommendation.  The attached image contains the states of the wait path.  I think this may be enough to get back to the code, but should I confuse myself I'll start looking at the requeue (waking) path as well.
Comment 32 IBM Bug Proxy 2009-03-23 21:11:23 EDT
------- Comment From dvhltc@us.ibm.com 2009-03-23 21:03 EDT-------
I've updated the wakeup path and am now testing and fixing bugs.  Some of the refactoring is getting... hairy, once it's working we'll have to look at polishing it up a bit.  Caught a mishandling of the pi_state and am currently testing a fix for that.
Comment 33 IBM Bug Proxy 2009-03-24 04:51:17 EDT
Created attachment 336431 [details]
requeue-pi testcase


------- Comment on attachment From dvhltc@us.ibm.com 2009-03-24 04:41 EDT-------


This is an updated version of the requeue_pi testcase which now runs the broadcast and signal pi tests with the outer mutex held and then repeats with it not held.  The signal with it not held test (the last one run) is still causing a NULL pointer dereference in the raw spinlock code.  Since the signal path holds the hb locks for a much shorter period of time, I think we're hitting a race with the hblocks around were the top_waiter is given the rtmutex by the waker (I believe we are hitting this in the waiter thread).  The output follows:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff804d699f>] _spin_lock+0xe/0x1f
PGD 23b1a0067 PUD 23d937067 PMD 0 
Oops: 0002 [#6] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor
CPU 2 
Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth dm_mirror dm_region_hash dm_log dm_multipath dm_mod sbs sbshc battery ac parport_pc lp parport sg tg3 libphy serio_raw button shpchp pcspkr mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 3894, comm: requeue_pi Tainted: G      D    2.6.29-rc7-dvh04 #6 9228MC1
RIP: 0010:[<ffffffff804d699f>]  [<ffffffff804d699f>] _spin_lock+0xe/0x1f
RSP: 0018:ffff88023d9edc18  EFLAGS: 00010206
RAX: 0000000000000100 RBX: ffff88023d9edca0 RCX: ffff88023d964e80
RDX: ffff88023d9edb08 RSI: 0000000000000202 RDI: 0000000000000000
RBP: ffff88023d9edc18 R08: ffff88023d9ec000 R09: 0000000000000000
R10: ffff88023d9edc58 R11: 0000000000000000 R12: ffffffff8088f868
R13: 0000000000000000 R14: ffff88023d9edc58 R15: 0000000000601960
FS:  000000004150d940(0063) GS:ffff88023eaed4c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000023c857000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process requeue_pi (pid: 3894, threadinfo ffff88023d9ec000, task ffff88023d584e40)
Stack:
 ffff88023d9edde8 ffffffff802599ce 0000000100000000 ffffffff00000000
 0000000000000000 0000000000000001 0000000000000001 0000000000000000
 ffff880200000062 ffff88023d9edc60 ffff88023d9edc60 ffff88023d9edc70
Call Trace:
 [<ffffffff802599ce>] futex_wait_requeue_pi+0x2ab/0x622
 [<ffffffff803b4221>] ? n_tty_receive_buf+0xdd6/0xe45
 [<ffffffff80280020>] ? cpupri_set+0xca/0xdd
 [<ffffffff8025ab43>] do_futex+0x788/0x800
 [<ffffffff8024f85c>] ? remove_wait_queue+0x34/0x40
 [<ffffffff8022fe76>] ? __wake_up+0x48/0x54
 [<ffffffff803b6116>] ? tty_ldisc_deref+0x6e/0x73
 [<ffffffff8020bed6>] ? ftrace_call+0x5/0x2b
 [<ffffffff8025acd6>] sys_futex+0x11b/0x139
 [<ffffffff8020c24b>] system_call_fastpath+0x16/0x1b
Code: 00 00 00 01 0f 94 c0 84 c0 ba 01 00 00 00 75 09 f0 81 07 00 00 00 01 30 d2 c9 89 d0 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 01 00 00 <f0> 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c9 c3 55 48 89 e5 
RIP  [<ffffffff804d699f>] _spin_lock+0xe/0x1f
 RSP <ffff88023d9edc18>
CR2: 0000000000000000
---[ end trace 766e92b7fcb1d0b1 ]---
Comment 34 IBM Bug Proxy 2009-03-24 05:11:11 EDT
------- Comment From dvhltc@us.ibm.com 2009-03-24 05:08 EDT-------
With the patches in my requeue-pi-dev branch as of right now, the following ftrace can be observed running the previously attached test case:

# tracer: function
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
requeue_pi-3405  [002]   302.797638: futex_requeue: start: nr_wake=1, nr_requeue=0, requeue_pi=1
requeue_pi-3405  [002]   302.797642: futex_requeue: futex_lock_pi_atomic returned: 1
requeue_pi-3405  [002]   302.797643: futex_requeue: waking the top waiter: ffff88023b559c58
requeue_pi-3405  [002]   302.797646: futex_requeue: futex_requeue_pi_init returned: 1
requeue_pi-3405  [002]   302.797648: lookup_pi_state: allocating a new pi state
requeue_pi-3405  [002]   302.797648: lookup_pi_state: initial rt_mutex owner: ffff88023c428080
requeue_pi-3405  [002]   302.797649: futex_requeue: got the lock, grab the pi_state: ffff88023ad90740
requeue_pi-3405  [002]   302.797650: rt_mutex_start_proxy_lock: lock(ffff88023ad90750)->owner: ffff88023c428080
requeue_pi-3405  [002]   302.797651: rt_mutex_start_proxy_lock: lock has waiter flag: 2
requeue_pi-3395  [003]   302.797653: futex_wait_requeue_pi: waiting for the first hb lock
requeue_pi-3405  [002]   302.797655: rt_mutex_start_proxy_lock: task_blocks_on_rt_mutex returned 0
requeue_pi-3405  [002]   302.797656: futex_requeue: freeing pi_state
requeue_pi-3395  [003]   302.797658: futex_wait_requeue_pi: We have been requeued, q.lock_ptr ((null))

^ We will crash immediately after this point.  Either it shouldn't be getting set to NULL, or my logic is faulty and I shouldn't be trying to use this lock here.  I'm too tired to be trusted to decide which though tonight.  More tomorrow.

requeue_pi-3395  [003]   303.253386: rt_mutex_unlock: unlocking ffff88023ad90750
requeue_pi-3395  [003]   303.253386: rt_mutex_unlock: (fastunlock) lock waiter flag: 2
requeue_pi-3395  [003]   303.253387: rt_mutex_slowunlock: (slowunlock) rt_mutex (ffff88023ad90750) has waiters
requeue_pi-3395  [003]   303.253388: rt_mutex_slowunlock: lock (ffff88023ad90750) top waiter: ffff88023ad2dcd8   task: ffff88023ae88a40
requeue_pi-3395  [003]   303.253389: rt_mutex_slowunlock: setting waiter->task to NULL
requeue_pi-3397  [000]   303.253396: futex_wait_requeue_pi: waiting for the first hb lock
requeue_pi-3397  [000]   303.253397: futex_wait_requeue_pi: We have been requeued, q.lock_ptr (ffffffff8088eb20)
requeue_pi-3397  [000]   303.253398: futex_wait_requeue_pi: q.pi_state=ffff88023ad90740
owner=ffff88023ae88a40
requeue_pi-3397  [000]   303.253399: futex_wait_requeue_pi: don't have the lock, pi_state: ffff88023ad90740
requeue_pi-3397  [000]   303.253400: __rt_mutex_slowlock: waiter->task is (null)
requeue_pi-3397  [000]   303.253401: try_to_take_rt_mutex: assigned rt_mutex (ffff88023ad90750) owner to current ffff88023ae88a40
requeue_pi-3397  [000]   303.253401: __rt_mutex_slowlock: got the lock
Comment 35 IBM Bug Proxy 2009-03-24 20:01:35 EDT
------- Comment From dvhltc@us.ibm.com 2009-03-24 19:52 EDT-------
I've fixed the bug from the last comment and one following that.  The current requeue-pi-dev branch contains the fixes and passes all the requeue_pi tests except for the broadcast without an owner which hangs after waking the first waiter (which takes the lock atomically and then blocks on the hb lock which the waker never unlocks for some reason).  I'll take a stab at knocking that one out tomorrow.

If anyone is looking for a way to help, augmenting the requeue_pi.c test case attached earlier (Attachment 44217 [details]) to include tests with randomized timeouts and signal handling to exercise those paths in the kernel would be very helpful.  We should also be adding unmapped shared futexes and invalid uaddr values to ensure against DOS vulnerabilities.

------- Comment From dvhltc@us.ibm.com 2009-03-24 19:52 EDT-------
Oh, also need to add tests where the mutex is held by a thread other than the waker.
Comment 36 IBM Bug Proxy 2009-03-25 00:51:13 EDT
------- Comment From dvhltc@us.ibm.com 2009-03-25 00:40 EDT-------
A trace from the hang with broadcast without holding the mutex:

[root@elm3b160 dvhart]# cat !$
cat trace.dvh10
# tracer: function
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
kjournald-4261  [002] 17797.561814: futex_requeue: start: nr_wake=1, nr_requeue=2147483647, requeue_pi=1
kjournald-4261  [002] 17797.567754: futex_requeue: futex_lock_pi_atomic returned: 1
kjournald-4261  [002] 17797.574898: futex_requeue: waking the top waiter: ffff88023acd9c58
syslogd-4251  [000] 17797.574905: futex_wait_requeue_pi: waiting for the first hb lock
kjournald-4261  [002] 17797.586410: futex_requeue: futex_requeue_pi_init returned: 1
kjournald-4261  [002] 17797.586412: lookup_pi_state: matched futex, pi_state already exists: (null)

^ that looks mighty suspicious to me, investigating there.

kjournald-4261  [002] 17797.586412: futex_requeue: got the lock, grab the pi_state: (null)
kjournald-4261  [002] 17797.586413: futex_requeue: futex_requeue returning -22
syslogd-4251  [000] 17797.586413: futex_wait_requeue_pi: We have been requeued, q.lock_ptr (ffffffff808900b0)
syslogd-4251  [000] 17797.586414: futex_wait_requeue_pi: q.pi_state=(null)
owner=(null)
syslogd-4251  [000] 17797.586415: futex_wait_requeue_pi: waker got the lock for us, pi_state: (null)
Comment 37 IBM Bug Proxy 2009-03-25 02:21:03 EDT
------- Comment From dvhltc@us.ibm.com 2009-03-25 02:18 EDT-------
I am very happy to announce that the current requeue_pi.c testcase passes all four unit tests with the current as of now requeue-pi-dev branch.  I have added some FIXME's to the patches at some known issues.  I'd welcome input on those.  And as I mentioned earlier, any help beefing up the testcase for signals, timeouts, faulting, and invalid user data would be much appreciated.
Comment 38 IBM Bug Proxy 2009-03-25 19:51:13 EDT
Created attachment 336726 [details]
requeue-pi testcase w/ 3rd party blockers and timeouts


------- Comment on attachment From dvhltc@us.ibm.com 2009-03-25 19:42 EDT-------


This version includes tests for timeouts as well as the previously unsupported -EMORON case where a 3rd party owns the mutex during requeue.  Currently the waiter returns -ETIMEDOUT if a timeout is used, regardless of the timeout value.  Debugging that now, the latest requeue-pi-dev branch contains updated ftrace instrumentation to try and track this down.
Comment 39 IBM Bug Proxy 2009-03-26 07:41:10 EDT
------- Comment From sripathik@in.ibm.com 2009-03-26 07:36 EDT-------
I now have another version of glibc prototype ready. Changes:
1) Fixed the deadlock in the futex_wait exit path. This was because I was trying to hold the external mutex, but the lock would have been already held in the kernel.
2) Implemented the TIMEDWAIT parts.
3) Simplified the patch by removing the __builtin_expect conditions.... optimization is not all that important at this stage!

This version of glibc runs my simple test case (test_requeue_pi_existing.c) without hanging the program. I have not done much more testing with it yet.

I have put the compiled libpthread library at http://kernel.beaverton.ibm.com/~sripathi/requeue_pi/libpthread.so. It can be used by setting LD_LIBRARY_PATH. This directory also contains the patch (glibc_requeue_pi_prototype_v3.patch) and built rpms. Install the rpms at your own risk, because I have not installed them my self ;-)
Comment 40 IBM Bug Proxy 2009-03-26 19:41:24 EDT
Created attachment 336927 [details]
requeue_pi testcase w/ absolute timeouts


------- Comment on attachment From dvhltc@us.ibm.com 2009-03-26 19:35 EDT-------


This version of the testcase uses absolute timeouts and requires the patches from requeue_pi_dev branch as of $now.
Comment 41 IBM Bug Proxy 2009-03-26 20:21:08 EDT
------- Comment From dvhltc@us.ibm.com 2009-03-26 20:15 EDT-------
Latest requeue-pi-dev patches include suggestions from Dinakar as well as a completely reworked requeue loop in futex_requeue() since the earlier one worked for requeue_pi but didn't actually requeue anything for the non-pi case ... oops.

Also included is the incremental patch "signal-handling-fixes-incremental" that contains some comments in futex_wait_requeue_pi about a wakeup race with signals, timeouts, and futex_unlock_pi and includes what information I've gathered thus far on serialization.  I think the fix is to take q.lock_ptr followed by q.pi_state.pi_mutex.wait_lock (in that order) to serialize with futex_unlock_pi() and wake_futex_pi().  rt_mutex_finish_proxy_lock() will need to be updated to require the wait_lock to be held on entry so we can take it a bit earlier in futex_wait_requeue_pi().  Thoughts on this approach are welcomed!
Comment 42 IBM Bug Proxy 2009-03-27 03:41:18 EDT
------- Comment From dvhltc@us.ibm.com 2009-03-27 03:38 EDT-------
I've hit a bug in the signal handling logic of futex_wait_requeue_pi() that results in a softlockup and am currently working to resolve it.  I'll try tomorrow with the rt-watchdog and see if I can kill the runaway threads and collect the ftrace.  Latest code is of course still available via the git repository.
Comment 43 IBM Bug Proxy 2009-03-27 18:51:46 EDT
------- Comment From dvhltc@us.ibm.com 2009-03-27 18:40 EDT-------
Signal handling bug caught and fixed.  Patch headers reviewed, comments updated, rebase to latest tip complete and basic testing passed.  A careful review of the rebased futex.c is needed and will likely have to wait until Monday.  In the mean-time, I'd appreciate a review of the requeue-pi-v6-rc1 branch:

http://git.kernel.org/?p=linux/kernel/git/dvhart/linux-2.6-tip-hacks.git;a=shortlog;h=requeue-pi-v6-rc1
Comment 44 IBM Bug Proxy 2009-04-10 13:42:13 EDT
------- Comment From dvhltc@us.ibm.com 2009-04-10 13:30 EDT-------
This bug is for the kernel side syscall implementation of requeue_pi.  The patches have been submitted and accepted into -rt and -tip, and will likely be included in 2.6.31.  MRG has also pulled them into their preliminary alpha builds for MRG 1.2.

------- Comment From dvhltc@us.ibm.com 2009-04-10 13:31 EDT-------
Leaving as FIXEDAWAITINGTEST for a while as I am still working on the C test cases.  The community is currently pounding away on it.  We are waiting for the JTC to have some cycles to "unleash the beasts" on it.
Comment 45 IBM Bug Proxy 2009-04-28 11:01:51 EDT
------- Comment From paul_thwaite@uk.ibm.com 2009-04-28 10:54 EDT-------
Hi Darren,

Mark mentioned you would like some Java stress testing for this bug.

Please could you let me know more details and I will arrange for a machine to be installed and run some tests :

- Any particular hardware should we use?
- What base OS should we use?  Is RHEL5.2 OK?
- Where can we download the kernel from?
- Anthing else I need to know before running the tests?

Thanks, Paul.
Comment 46 IBM Bug Proxy 2009-04-29 05:51:46 EDT
------- Comment From paul_thwaite@uk.ibm.com 2009-04-29 05:42 EDT-------
Hi Darren,

Please may I we have an RPM because it wil be easier to install across machines.  Thanks.
Comment 47 IBM Bug Proxy 2009-05-06 15:11:37 EDT
------- Comment From dino@in.ibm.com 2009-05-06 15:06 EDT-------
The glibc RPMs with the REQUEUE_PI support can be found at the URL below. Please try out these RPMs after you are done testing Darren's kernel.

http://kernel.beaverton.ibm.com/~dino/rt/glibc/requeue-pi/

You need all the rpms from the above dir and they can be installed on a RHEL5.3 system only. The command to install this would be
rpm -Uvh --replacefiles --replacepkgs *.rpm

I have run LTC's release testing (includes a 100 calibrate run and a long running recalibrate) with these rpms installed as above. Let me know if you have any issues, Thanks

Note: The built glibc version is the same as the existing one on a RHEL5.3 system. This is intentional. To figure out if you have the right glibc, run this command
[root@elm3b172 built-rpms]# rpm -q -i glibc | grep Build
Release     : 34                            Build Date: Mon 05 Jan 2009 07:54:28 PM EST
Install Date: Sun 11 Jan 2009 09:37:14 PM EST      Build Host: hs20-bc2-5.build.redhat.com
Release     : 34                            Build Date: Wed 06 May 2009 12:09:47 PM EDT
Install Date: Wed 06 May 2009 02:50:24 PM EDT      Build Host: elm3b172.beaverton.ibm.com
Comment 48 IBM Bug Proxy 2009-05-26 06:41:49 EDT
------- Comment From dvhltc@us.ibm.com 2009-03-26 19:23 EDT-------
(In reply to comment #65)
> Created an attachment (id=44277) [details]
> requeue-pi testcase w/ 3rd party blockers and timeouts
>
> This version includes tests for timeouts as well as the previously unsupported
> -EMORON case where a 3rd party owns the mutex during requeue.  Currently the
> waiter returns -ETIMEDOUT if a timeout is used, regardless of the timeout
> value.  Debugging that now, the latest requeue-pi-dev branch contains updated
> ftrace instrumentation to try and track this down.

At first I thought this was because the kernel wasn't making this timeout absolute.  As it turns out, the problem is REALLY that my testcase didn't make the timeouts absolute.  Since we have to implement the glibc side of things here anyway, let's make the userspace interface and the kernel interface the same: absolute timeouts.  I've fixed the kernel to do this and am in process of fixing the testcase.
Comment 49 IBM Bug Proxy 2009-06-02 12:13:10 EDT
------- Comment From paul_thwaite@uk.ibm.com 2009-06-02 12:01 EDT-------
WITHOUT the Glibc RPMs - stress tests were kicked off on a LS21 machine and we experienced a machine hang after the fourth 1 hour test.  Unfortunately no diagnostics / kernel messages were found.  The machine has been rebooted with crashdump enabled and testing has been restarted.   So far, no new failures have been seen.  It takes a while to run the stress tests (they're 1 hour and 3 hour tests) so I expect to have results by the end of the week.
Comment 50 IBM Bug Proxy 2009-06-10 21:01:48 EDT
------- Comment From johnstul@us.ibm.com 2009-06-10 20:51 EDT-------
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=52400ba946759af28442dee6265c5c0180ac7122;hp=f801073f87aa22ddf0e9146355fec3993163790f

Darren's patches have landed in mainline!
Comment 51 IBM Bug Proxy 2009-06-11 12:11:53 EDT
Created attachment 347434 [details]
kernel crash in machine_kexec


------- Comment on attachment From paul_thwaite@uk.ibm.com 2009-06-11 12:09 EDT-------


We've not managed to run many tests because the machine has died three times thus far. 

The last problem caused a kernel crash in machine_kexec.  I have attached a print screen of the crash. 

Please advise on the next plan of action?  Thanks.
Comment 52 IBM Bug Proxy 2009-06-15 01:31:48 EDT
------- Comment From sripathik@in.ibm.com 2009-06-15 01:23 EDT-------
(In reply to comment #85)
> Created an attachment (id=45926) [details]
> kernel crash in machine_kexec
>
> We've not managed to run many tests because the machine has died three times
> thus far.
>
> The last problem caused a kernel crash in machine_kexec.  I have attached a
> print screen of the crash.
>
> Please advise on the next plan of action?  Thanks.
>

Hi Paul,

From the backtrace, it looks like the crash originated in __run_posix_cpu_timers. I suspect that kdump was enabled, but it could not be triggered for some reason, hence the RIP is at machine_kexec. Also, I suspect panic_on_oops was enabled.

Do you know whether the same backtrace was seen in all the three crashes you have seen? Have you got a kdump out of any of the three crashes?

In any case, I think we should close this bug, because the patches have now landed in mainline as well. It will be nice if you could open a new bug for machine crashes seen.

Darren, do you agree?
Comment 53 IBM Bug Proxy 2009-06-15 11:32:02 EDT
------- Comment From dvhltc@us.ibm.com 2009-06-15 11:26 EDT-------
Sripathi, yes I agree.  The new crash Paul is mentioning appears to be something independent of this fix.
Comment 54 IBM Bug Proxy 2009-07-10 12:23:06 EDT
------- Comment From dvhltc@us.ibm.com 2009-07-10 12:18 EDT-------
Suggest closing it.  The alpha kernels have the patches.
Comment 55 IBM Bug Proxy 2009-10-21 16:41:51 EDT
------- Comment From fultonm@ca.ibm.com 2009-10-21 16:39 EDT-------
I am pretty sure another instance of this problem can be seen with the real-time microbenchmark (see mike fulton or john stultz if you don't have it yet)

The fastest way to see is to run with:

export RTMB_OPTS='-ta 262us --test GarbageGenerationTest'
export OPTS="-Xmx2024m -Xgc:scopedMemoryMaximumSize=500m -Xgc:noSynchronousGCOnOOM"
export CHILD_OPTS="-Xmx128m -Xrealtime -Xgc:scopedMemoryMaximumSize=10m -Xgc:noSynchronousGCOnOOM"
(export PATH=/home/fultonm/ibm-wrt2-ga/jre/bin:$PATH; java $OPTS -Xgcpolicy:metronome -jar lib/rtmb.jar  -v:simple --jvm-options "$CHILD_OPTS" $RTMB_OPTS)

The output from the garbage generation test will show a pretty large variation in times between fastest and slowest GC's. Alan Steven's has traced this back to this problem, but running this test will be a good way to see if the problem is fixed.

Current output is similar to:

Test result for Thread 0
==========================================

Measured median for this test run: 30250.500 us
Measured maximum time for this test run: 76772.000 us
Measured minimum time for this test run: 28353.000 us
Measured mean for this test run: 30906.995 us
Measured standard deviation for this test run: 2286.327 us
Expected number of iterations between 21175.350 us and 39325.650 us: 9900
Total number of conforming iterations: 9758
Total number of outlying iterations: 242

Test result for Thread 1
==========================================

Measured median for this test run: 26727.500 us
Measured maximum time for this test run: 80685.000 us
Measured minimum time for this test run: 24942.000 us
Measured mean for this test run: 27165.975 us
Measured standard deviation for this test run: 1939.553 us
Expected number of iterations between 18709.250 us and 34745.750 us: 9900
Total number of conforming iterations: 9738
Total number of outlying iterations: 262

Test result for Thread 2
==========================================

Measured median for this test run: 26617.000 us
Measured maximum time for this test run: 72786.000 us
Measured minimum time for this test run: 24724.000 us
Measured mean for this test run: 27289.974 us
Measured standard deviation for this test run: 2158.613 us
Expected number of iterations between 18631.900 us and 34602.100 us: 9900
Total number of conforming iterations: 9736
Total number of outlying iterations: 264

Test result for Thread 3
==========================================

Measured median for this test run: 26614.000 us
Measured maximum time for this test run: 80069.000 us
Measured minimum time for this test run: 24766.000 us
Measured mean for this test run: 27040.058 us
Measured standard deviation for this test run: 1905.609 us
Expected number of iterations between 18629.800 us and 34598.200 us: 9900
Total number of conforming iterations: 9734
Total number of outlying iterations: 266

So - we are seeing a maximum time of 81ms with a minimum time of 25ms (roughly).

It should be much tighter with a range of around 9ms (i.e. 25ms to 34ms).
Comment 56 IBM Bug Proxy 2009-10-21 16:51:41 EDT
------- Comment From dvhltc@us.ibm.com 2009-10-21 16:43 EDT-------
This bug tracks the kernel bits necessary for Requeue PI.  These are now in
place in both the upstream RT kernel as well as the early development MRG 1.2
kernels.  This bug can be closed, however, the related bugs (bug 45000, bug
52715, bug 45400, bug 52719) are waiting on the final glibc updates.  For
reference, I sent the following to Alan Stevens this morning regarding the
state of that:

he latest 2.6.31.4-rt14 has most of the requeue_pi bits, with a couple corner
case and abuse patches pending for the next release.

In order to make use of these bits, you need a glibc patch to use the new futex
syscall op-codes. The problem you are interested in however, is probably a pure
glibc change which makes the condvar internal lock PI-aware.

This latter change is unfortunately problematic due to the POSIX API. We had
initially wanted to do this dynamically, checking the assocaiated mutex and
making either a regular locking call or a PI locking call as appropriate. A
condvar is not initialized with a mutex, and the mutex is not passed to the
pthread_cond_broadcast() and _signal() calls. Therefore, these calls  don't
know if the lock should be PI or not until a call to pthread_cond_wait() or
_timedwait() is made, which associates a mutex with the condvar.
Unfortunately, this leads to a race condition which can lead to different futex
op-codes being used on the same condvar internal lock, leading to deadlock.

To fix this problem, Dinakar is adding a new API, something along the lines of:

pthread_condattr_setprotocol(PTHREAD_PRIO_INHERIT)

to force a condvar into PI mode and avoid the race. This will unfortunately
necessitate code changes in applications wishing to use PI-aware condvars.

If you would like to try the existing bits (with the known race condition,
which we can hit with specjbb) to get a feel for how things will work, Dinakar
can point you to the right glibc.  I can get you a MRG 1.2 alpha kernel with
the kernel support as well.

------- Comment From dvhltc@us.ibm.com 2009-10-21 16:45 EDT-------
(In reply to comment #92)
> I am pretty sure another instance of this problem can be seen with the
> real-time microbenchmark (see mike fulton or john stultz if you don't have it
> yet)

I suspect this is more likely due to a priority inversion on the internal condvar lock, in which case this data should be added to Bug 52719 which tracks that as a separate issue.  Bug 45400 is also dependent on this facet of the broader "requeue pi" picture.

------- Comment From fultonm@ca.ibm.com 2009-10-21 16:49 EDT-------
(In reply to comment #92)
> I am pretty sure another instance of this problem can be seen with the
> real-time microbenchmark (see mike fulton or john stultz if you don't have it
> yet)

sourceforge link to download rtmb benchmark is:
https://sourceforge.net/projects/rtmicrobench/
Please note that the rtmb.zip binary is -not- the latest code. There are a
couple changes that have occurred since then, so it is best to get the source
via svn.
Comment 57 IBM Bug Proxy 2010-07-15 13:51:32 EDT
------- Comment From dvhltc@us.ibm.com 2010-07-15 13:42 EDT-------
The kernel code has been accepted and glibc now uses the new futex op codes. The glibc internal lock not being PI is a separate issues and is tracked via: Bug 45400. Closing out this bug.

Note You need to log in before you can comment on or make changes to this bug.