Description of problem: We are experiencing an application hang that appears to be caused by a problem purportedly fixed in kernel 2.6.21.7. The application is running a stress test with many threads. Eventually all threads become blocked waiting for a chain of related events. None of the threads can proceed however because one thread has been "deadlocked" by the kernel. The relevant part of the stack trace is as follows: Thread 133 (Thread -1738126448 (LWP 1236)): #0 0xffffe410 in __kernel_vsyscall () #1 0x499166c6 in __pause_nocancel () from /lib/libpthread.so.0 #2 0x499115f5 in pthread_mutex_lock () from /lib/libpthread.so.0 #3 0xb74e8975 in os::Linux::mutex_lock (mutex=0x82c0878) at /space/rw140007/hatteras-local-compile2/hatteras/src/os/linux/vm/os_linux.hpp:182 The __pause_nocancel routine will deadlock a thread if it believes the owner of the mutex concerned has died while holding that mutex. According to the 2.6.21.7 Changelog there are bugs in the underlying pi-futex code that can cause this: http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.21.7 "commit b6c7b07330bf7271419ce7403d2551f330986af3 Author: Alexey Kuznetsov <kuznet.ac.ru> Date: Fri Jun 8 10:29:30 2007 +0000 pi-futex: Fix exit races and locking problems 1. New entries can be added to tsk->pi_state_list after task completed exit_pi_state_list(). The result is memory leakage and deadlocks. 2. handle_mm_fault() is called under spinlock. The result is obvious. 3. results in self-inflicted deadlock inside glibc. Sometimes futex_lock_pi returns -ESRCH, when it is not expected and glibc enters to for(;;) sleep() to simulate deadlock. This problem is quite obvious and I think the patch is right. Though it looks like each "if" in futex_lock_pi() got some stupid special case "else if". :-) 4. sometimes futex_lock_pi() returns -EDEADLK, ... " We say this is "purported to have been fixed" because we have the same failure on a different distribution based on a 2.6.22 kernel. We are also uncertain whether RHEL-RT is already at 2.6.21.7 level. How reproducible: The test application encounters this each time I run it.
We are currently rebasing to 2.6.21.7 (previous was from 2.6.21.5) David, we'll let you know when it's available. Could you test it (then) to see if it solves the issues for you. We'll run it through our testsuite tonight. If it passes, we should have it available tomorrow. Of course if we have issues with it, it may take a bit more time.
We will install and test as soon as practical. Thanks.
We've done some kernel level tracing. We've found that: The failures occur when 3 threads (t1, t2, t3) operate on the same futex f. Here is the chain of events leading to the problem: - t1 is the owner of f - t2 tries to acquire f. It fails in userland so it uses the futex syscall with the FUTEX_LOCK_PI. t2 blocks on the kernel PI mutex associated with the futex in the rt_mutex_timed_lock() call in the futex_lock_pi() function. - t1 releases f. It uses the futex syscall with the FUTEX_UNLOCK_PI command. It finds t2 waiting on the futex and elects it as next owner of the futex. It sets f's user land value to the tid of t2 and releases the kernel PI mutex. - in the meantime, t2 receives a signal and returns from rt_mutex_timed_lock() with -EINTR. It does not own the kernel PI mutex. - t3 tries to acquire f. f's userland value contains t2's tid so f is not free. t3 enters the kernel with the FUTEX_LOCK_PI command and grabs the kernel PI mutex which is free (t2 failed to acquire it and t1 released it). - t2 now exits the futex_lock_pi() function and the kernel. It grabs the spinlock, but because rt_mutex_timed_lock() returns with an error and because it cannot grab the kernel PI mutex, the userland value of the futex is not modified: it still contains t2's tid. - t2 attempts the FUTEX_LOCK_PI command again because the previous attempt failed with a EINTR. One of the first check performed in futex_lock_pi() is against the userland value of the futex. It contains t2's tid. The futex syscall returns with EDEADLK. When the libc sees this error code, it hangs t2. The futex will eventually be back to a consistent state. t3 will exit from futex_lock_pi(). In the process, because it owns the kernel PI mutex while not being the recorded owner of the futex, the futex state will be fixed. An attempted fix follows (patch against redhat rt 2.6.21 kernel). When t2 detects on exit from futex_lock_pi() that it is recorded as owner of the futex while not owning the kernel PI mutex, it changes the userland futex value to that of the kernel PI mutex. --- kernel-2.6.21/linux-2.6.21.i686/kernel/futex.c 2008-01-04 14:54:55.000000000 +0000 +++ kernel-2.6.21-futexfix/linux-2.6.21.i686/kernel/futex.c 2008-01-04 14:02:01.000000000 +0000 @@ -532,22 +532,22 @@ * the refcount and return its pi_state: */ pi_state = this->pi_state; /* * Userspace might have messed up non PI and PI futexes */ if (unlikely(!pi_state)) return -EINVAL; WARN_ON(!atomic_read(&pi_state->refcount)); - WARN_ON(pid && pi_state->owner && - pi_state->owner->pid != pid); +/* WARN_ON(pid && pi_state->owner && */ +/* pi_state->owner->pid != pid); */ atomic_inc(&pi_state->refcount); *ps = pi_state; return 0; } } /* * We are the first waiter - try to look up the real owner and attach @@ -1905,20 +1905,41 @@ * Paranoia check. If we did not take the lock * in the trylock above, then we should not be * the owner of the rtmutex, neither the real * nor the pending one: */ if (rt_mutex_owner(&q.pi_state->pi_mutex) == curr) printk(KERN_ERR "futex_lock_pi: ret = %d " "pi-mutex: %p pi-state %p\n", ret, q.pi_state->pi_mutex.owner, q.pi_state->owner); + + if(q.pi_state->owner == curr) { + int ret; + struct task_struct *owner = rt_mutex_owner(&q.pi_state->pi_mutex); + u32 newtid = owner->pid | FUTEX_WAITERS; + u32 uval, curval, newval; + + ret = get_futex_value_locked(&uval, uaddr); + while (!ret) { + newval = (uval & FUTEX_OWNER_DIED) | newtid; + newval |= (uval & FUTEX_WAITER_REQUEUED); + + curval = cmpxchg_futex_value_locked(uaddr, uval, newval); + + if (curval == -EFAULT) + ret = -EFAULT; + if (curval == uval) + break; + uval = curval; + } + } } } /* Unqueue and drop the lock */ unqueue_me_pi(&q); futex_unlock_mm(fshared); return ret != -EINTR ? ret : -ERESTARTNOINTR; out_unlock_release_sem:
Your analysis is correct. We have a transient state where the user space value is wrong. The fix is not completely correct, as it creates a new - although extremly tight - race window - due to the unlocked access to the rtmutex owner. Not sure yet, whether it matters or not. I have a closer look. Thanks, tglx
Created attachment 291060 [details] mainline fix The attached patch is the fix for mainline. Roland confirmed that it fixes the bug in mainline. Clark has a back port for rhel-rt for the new release.
can we confirm that the latest Red Hat RT kernel (2.6.24.1-24.el5rt) has the mainline fix?
I just confirmed that this patch is in the 2.6.24.4-30.el5rt kernel Roland, if you concur, I think we can close this. Clark
I confirm that the bug is fixed in 2.6.24.4-30.el5rt, It can be closed.
We commenced more extensive testing on 2.6.24-30.el5rt and are finding a new failure - pthread_mutex_unlock is returning EPERM in cases where we definitely have the mutex locked. I noticed this fix operated on the "owner" a bit and was wondering whether it may have caused this new problem?