400541 – hang due to __pause_nocancel - pi-futex fix may be needed from 2.6.21.7

Bug 400541 - hang due to __pause_nocancel - pi-futex fix may be needed from 2.6.21.7

Summary: hang due to __pause_nocancel - pi-futex fix may be needed from 2.6.21.7

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	1.0
Hardware:	i386
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Steven Rostedt
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-11-27 04:27 UTC by David Holmes
Modified:	2008-04-22 02:42 UTC (History)
CC List:	3 users (show)
Fixed In Version:	2.6.24.4-30.el5rt
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-04-07 14:20:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
mainline fix (5.16 KB, patch) 2008-01-08 16:06 UTC, Thomas Gleixner	no flags	Details \| Diff
View All

Description David Holmes 2007-11-27 04:27:45 UTC

Description of problem:

We are experiencing an application hang that appears to be caused by a problem
purportedly fixed in kernel 2.6.21.7.

The application is running a stress test with many threads. Eventually all
threads become blocked waiting for a chain of related events. None of the
threads can proceed however because one thread has been "deadlocked" by the
kernel. The relevant part of the stack trace is as follows:

Thread 133 (Thread -1738126448 (LWP 1236)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x499166c6 in __pause_nocancel () from /lib/libpthread.so.0
#2  0x499115f5 in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0xb74e8975 in os::Linux::mutex_lock (mutex=0x82c0878) at
/space/rw140007/hatteras-local-compile2/hatteras/src/os/linux/vm/os_linux.hpp:182

The __pause_nocancel routine will deadlock a thread if it believes the owner of
the mutex concerned has died while holding that mutex.

According to the 2.6.21.7 Changelog there are bugs in the underlying pi-futex
code that can cause this:

http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.21.7

"commit b6c7b07330bf7271419ce7403d2551f330986af3
Author: Alexey Kuznetsov <kuznet.ac.ru>
Date:   Fri Jun 8 10:29:30 2007 +0000

    pi-futex: Fix exit races and locking problems
    
    1. New entries can be added to tsk->pi_state_list after task completed
       exit_pi_state_list(). The result is memory leakage and deadlocks.
    
    2. handle_mm_fault() is called under spinlock. The result is obvious.
    
    3. results in self-inflicted deadlock inside glibc.
       Sometimes futex_lock_pi returns -ESRCH, when it is not expected
       and glibc enters to for(;;) sleep() to simulate deadlock. This problem
       is quite obvious and I think the patch is right. Though it looks like
       each "if" in futex_lock_pi() got some stupid special case "else if". :-)
    
    4. sometimes futex_lock_pi() returns -EDEADLK, ...
"

We say this is "purported to have been fixed" because we have the same failure
on a different distribution based on a 2.6.22 kernel. We are also uncertain
whether RHEL-RT is already at 2.6.21.7 level.

How reproducible:

The test application encounters this each time I run it.

Comment 1 Steven Rostedt 2007-12-17 17:25:08 UTC

We are currently rebasing to 2.6.21.7 (previous was from 2.6.21.5)

David, we'll let you know when it's available. Could you test it (then) to see
if it solves the issues for you.

We'll run it through our testsuite tonight. If it passes, we should have it
available tomorrow. Of course if we have issues with it, it may take a bit more
time.

Comment 2 David Holmes 2007-12-17 22:50:11 UTC

We will install and test as soon as practical.

Thanks.

Comment 3 Roland Westrelin 2008-01-07 08:33:38 UTC

We've done some kernel level tracing. We've found that:

The failures occur when 3 threads (t1, t2, t3) operate on the same futex
f. Here is the chain of events leading to the problem:

- t1 is the owner of f

- t2 tries to acquire f. It fails in userland so it uses the futex
syscall with the FUTEX_LOCK_PI. t2 blocks on the kernel PI mutex
associated with the futex in the rt_mutex_timed_lock() call in the
futex_lock_pi() function.

- t1 releases f. It uses the futex syscall with the FUTEX_UNLOCK_PI
command. It finds t2 waiting on the futex and elects it as next owner of
the futex. It sets f's user land value to the tid of t2 and releases the
kernel PI mutex.

- in the meantime, t2 receives a signal and returns from
rt_mutex_timed_lock() with -EINTR. It does not own the kernel PI mutex.

- t3 tries to acquire f. f's userland value contains t2's tid so f is
not free. t3 enters the kernel with the FUTEX_LOCK_PI command and grabs
the kernel PI mutex which is free (t2 failed to acquire it and t1
released it).

- t2 now exits the futex_lock_pi() function and the kernel. It grabs the
spinlock, but because rt_mutex_timed_lock() returns with an error and
because it cannot grab the kernel PI mutex, the userland value of the
futex is not modified: it still contains t2's tid.

- t2 attempts the FUTEX_LOCK_PI command again because the previous
attempt failed with a EINTR. One of the first check performed in
futex_lock_pi() is against the userland value of the futex. It contains
t2's tid. The futex syscall returns with EDEADLK. When the libc sees
this error code, it hangs t2.

The futex will eventually be back to a consistent state. t3 will exit
from futex_lock_pi(). In the process, because it owns the kernel PI
mutex while not being the recorded owner of the futex, the futex state
will be fixed.

An attempted fix follows (patch against redhat rt 2.6.21 kernel). When
t2 detects on exit from futex_lock_pi() that it is recorded as owner of
the futex while not owning the kernel PI mutex, it changes the userland
futex value to that of the kernel PI mutex.

--- kernel-2.6.21/linux-2.6.21.i686/kernel/futex.c	2008-01-04 14:54:55.000000000
+0000
+++ kernel-2.6.21-futexfix/linux-2.6.21.i686/kernel/futex.c	2008-01-04
14:02:01.000000000 +0000
@@ -532,22 +532,22 @@
 			 * the refcount and return its pi_state:
 			 */
 			pi_state = this->pi_state;
 			/*
 			 * Userspace might have messed up non PI and PI futexes
 			 */
 			if (unlikely(!pi_state))
 				return -EINVAL;
 
 			WARN_ON(!atomic_read(&pi_state->refcount));
-			WARN_ON(pid && pi_state->owner &&
-				pi_state->owner->pid != pid);
+/* 			WARN_ON(pid && pi_state->owner && */
+/* 				pi_state->owner->pid != pid); */
 
 			atomic_inc(&pi_state->refcount);
 			*ps = pi_state;
 
 			return 0;
 		}
 	}
 
 	/*
 	 * We are the first waiter - try to look up the real owner and attach
@@ -1905,20 +1905,41 @@
 			 * Paranoia check. If we did not take the lock
 			 * in the trylock above, then we should not be
 			 * the owner of the rtmutex, neither the real
 			 * nor the pending one:
 			 */
 			if (rt_mutex_owner(&q.pi_state->pi_mutex) == curr)
 				printk(KERN_ERR "futex_lock_pi: ret = %d "
 				       "pi-mutex: %p pi-state %p\n", ret,
 				       q.pi_state->pi_mutex.owner,
 				       q.pi_state->owner);
+
+			if(q.pi_state->owner == curr) {
+			  int ret;
+			  struct task_struct *owner = rt_mutex_owner(&q.pi_state->pi_mutex);
+			  u32 newtid = owner->pid | FUTEX_WAITERS;
+			  u32 uval, curval, newval;
+			  
+			  ret = get_futex_value_locked(&uval, uaddr);
+			  while (!ret) {
+			    newval = (uval & FUTEX_OWNER_DIED) | newtid;
+			    newval |= (uval & FUTEX_WAITER_REQUEUED);
+
+			    curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
+
+			    if (curval == -EFAULT)
+			      ret = -EFAULT;
+			    if (curval == uval)
+			      break;
+			    uval = curval;
+			  }
+			}
 		}
 	}
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
 	futex_unlock_mm(fshared);
 
 	return ret != -EINTR ? ret : -ERESTARTNOINTR;
 
  out_unlock_release_sem:

Comment 4 Thomas Gleixner 2008-01-07 13:31:24 UTC

Your analysis is correct. We have a transient state where the user space value
is wrong. 

The fix is not completely correct, as it creates a new - although extremly tight
- race window - due to the unlocked access to the rtmutex owner. Not sure yet,
whether it matters or not. I have a closer look.

Thanks,
      tglx

Comment 5 Thomas Gleixner 2008-01-08 16:06:22 UTC

Created attachment 291060 [details]
mainline fix

The attached patch is the fix for mainline. Roland confirmed that it fixes the
bug in mainline. Clark has a back port for rhel-rt for the new release.

Comment 6 Clark Williams 2008-02-19 23:34:38 UTC

can we confirm that the latest Red Hat RT kernel (2.6.24.1-24.el5rt) has the
mainline fix?

Comment 7 Clark Williams 2008-04-03 19:50:16 UTC

I just confirmed that this patch is in the 2.6.24.4-30.el5rt kernel

Roland, if you concur, I think we can close this.

Clark

Comment 8 Roland Westrelin 2008-04-07 09:50:32 UTC

I confirm that the bug is fixed in 2.6.24.4-30.el5rt, It can be closed.

Comment 9 David Holmes 2008-04-22 02:42:06 UTC

We commenced more extensive testing on 2.6.24-30.el5rt and are finding a new
failure - pthread_mutex_unlock is returning EPERM in cases where we definitely
have the mutex locked. I noticed this fix operated on the "owner" a bit and was
wondering whether it may have caused this new problem?

Note You need to log in before you can comment on or make changes to this bug.