Created attachment 342733 [details] Tarball containing test code to exercise the described bug Description of problem: CONFIG_PREEMPT_BKL=y in kernel config can cause kernel threads woken from a wait queue to end up back in schedule() with a status of TASK_UNINTERRUPTIBLE, which means that anyone that tries to wake the next thread on the wait queue will simply signal the original thread to wake again. This means that wake_up() on 'exclusive' scheduled kernel threads is not a 1-1 relationship as it is when CONFIG_PREEMPT_BKL is not on. Version-Release number of selected component (if applicable): 2.6.18-53.1.4.el5 2.6.18-92.el5 2.6.18-128.el5 How reproducible: Since this is exercising the BKL, this bug only shows up on SMP systems. Very reproducible with attached test (simple kernel module, binary, and test script to reproduce bug). More difficult to reproduce in normal situations. There is a race between the activation/status change of the task by 'try_to_wake_up()' and the deactivation/status change of the task by 'reacquire_kernel_lock()'. There also needs to be another thread holding the BKL while the original woken thread is attempting to reacquire it, *and* another thread needs to be trying to wake the next item on the wait queue. This is probably only possible on 3+ core systems, though that is rather common these days. Steps to Reproduce: 1. Run test script 'test_bklbug.sh' in attachment as root. This will compile the code, create the character device node, and run the test, cleaning up after itself when finished. The test should take less than 10 seconds to run. If the bug is detected, the test script will output: *** BKL BUG PRESENT *** if not, it will output: *** NO BKL BUG DETECTED *** Actual results: On kernels with CONFIG_PREEMPT_BKL=y, one thread does not get woken. The dmesg output will show that the same thread gets signaled twice. This is because wake_up() is not guaranteed to be one-to-one on threads that hold the BKL and are scheduled exclusive, due to the fact that the thread can get back into schedule() after waking up if someone currently holds the BKL. Expected results: On kernels with CONFIG_PREEMPT_BKL=n (or on kernels where it doesn't exist - RHEL4 kernels for example) both threads get woken. This is because wake_up() is always one-to-one on threads that are scheduled exclusive, regardless of whether they hold the BKL or not. Additional info: I am an employee of Quantum Corp, and this is an issue that a customer hit using StorNext on RHEL5. The expectation of the StorNext file system is that wake_up() will wake one, and only one, exclusive waitqueue task, and once woken, will not signal it again. This has been valid on linux in 2.4 kernels, and is currently valid on RHEL4 and SLES10 systems. This bug showed up with the addition of the CONFIG_PREEMPT_BKL kernel config option in 2.6.11 (http://kerneltrap.org/node/3843). It has since been "fixed" in 2.6.26 when Linus reverted most of the BKL changes that had happened since 2.6.7. (http://kerneltrap.org/Linux/Removing_the_Big_Kernel_Lock) The order of events to cause the problem in the test module is as follows: Sleeper 1 == T1 Sleeper 2 == T2 Waker == T3 T1: comes in via vfs ioctl, which grabs the BKL T1: calls 'schedule()', BKL dropped T2: comes in via vfs ioctl, which grabs the BKL T2: calls 'schedule()', BKL dropped T3: grabs BKL T3: calls 'wake_up()' T1: 'try_to_wake_up()' operates on T1, which activates it (puts it on the run queue), sets it's task to TASK_RUNNING T1: starts to come out of 'schedule()' which calls 'reacquire_kernel_lock()' to grab the BKL, which it can't get because T3 holds it, so 'reacquire_kernel_lock()->__reacquire_kernel_lock()->down()' calls '__down()' which sets the task to TASK_UNINTERRUPTIBLE, puts it on it's own WAITQUEUE, and calls 'schedule()' which deactivates it (takes it off the run queue) T3: 'wake_up()' has in the meantime returned, after a short spin, T3 calls 'wake_up()' again T1: since T1 is once again in TASK_UNINTERRUPTIBLE and not on the run queue, T1 gets the wake_up again, but it's still waiting for the BKL, so nothing happens. T3: drops the BKL T1: acquires the BKL and goes on its way, dropping the BKL when it's done T2: doesn't get signaled because 2 signals were sent by T3 already, so it doesn't know there's anyone waiting.
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
What info do you need from me?
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).
Clearing needinfo flag - this won't get resolved in RHEL5 and isn't an issue in RHEL6+