Red Hat Bugzilla – Bug 499507
'CONFIG_PREEMPT_BKL=y' in RHEL5 kernel config breaks the 1-1 relationship between 'exclusive' schedule() calls and wake_up()
Last modified: 2014-08-11 01:40:53 EDT
Created attachment 342733 [details]
Tarball containing test code to exercise the described bug
Description of problem:
CONFIG_PREEMPT_BKL=y in kernel config can cause kernel threads woken
from a wait queue to end up back in schedule() with a status of
TASK_UNINTERRUPTIBLE, which means that anyone that tries to wake the
next thread on the wait queue will simply signal the original thread
to wake again. This means that wake_up() on 'exclusive' scheduled
kernel threads is not a 1-1 relationship as it is when
CONFIG_PREEMPT_BKL is not on.
Version-Release number of selected component (if applicable):
Since this is exercising the BKL, this bug only shows up on SMP systems.
Very reproducible with attached test (simple kernel module, binary,
and test script to reproduce bug).
More difficult to reproduce in normal situations. There is a race
between the activation/status change of the task by 'try_to_wake_up()'
and the deactivation/status change of the task by
'reacquire_kernel_lock()'. There also needs to be another thread
holding the BKL while the original woken thread is attempting to
reacquire it, *and* another thread needs to be trying to wake the next
item on the wait queue. This is probably only possible on 3+ core
systems, though that is rather common these days.
Steps to Reproduce:
1. Run test script 'test_bklbug.sh' in attachment as root. This will
compile the code, create the character device node, and run the test,
cleaning up after itself when finished. The test should take less
than 10 seconds to run.
If the bug is detected, the test script will output:
*** BKL BUG PRESENT ***
if not, it will output:
*** NO BKL BUG DETECTED ***
On kernels with CONFIG_PREEMPT_BKL=y, one thread does not get woken.
The dmesg output will show that the same thread gets signaled twice.
This is because wake_up() is not guaranteed to be one-to-one on
threads that hold the BKL and are scheduled exclusive, due to the fact
that the thread can get back into schedule() after waking up if
someone currently holds the BKL.
On kernels with CONFIG_PREEMPT_BKL=n (or on kernels where it doesn't
exist - RHEL4 kernels for example) both threads get woken.
This is because wake_up() is always one-to-one on threads that are
scheduled exclusive, regardless of whether they hold the BKL or not.
I am an employee of Quantum Corp, and this is an issue that a customer
hit using StorNext on RHEL5. The expectation of the StorNext file
system is that wake_up() will wake one, and only one, exclusive
waitqueue task, and once woken, will not signal it again. This has
been valid on linux in 2.4 kernels, and is currently valid on RHEL4
and SLES10 systems.
This bug showed up with the addition of the CONFIG_PREEMPT_BKL kernel
config option in 2.6.11 (http://kerneltrap.org/node/3843). It has
since been "fixed" in 2.6.26 when Linus reverted most of the BKL
changes that had happened since 2.6.7.
The order of events to cause the problem in the test module is as follows:
Sleeper 1 == T1
Sleeper 2 == T2
Waker == T3
T1: comes in via vfs ioctl, which grabs the BKL
T1: calls 'schedule()', BKL dropped
T2: comes in via vfs ioctl, which grabs the BKL
T2: calls 'schedule()', BKL dropped
T3: grabs BKL
T3: calls 'wake_up()'
T1: 'try_to_wake_up()' operates on T1, which activates it (puts it on
the run queue), sets it's task to TASK_RUNNING
T1: starts to come out of 'schedule()' which calls
'reacquire_kernel_lock()' to grab the BKL, which it can't get
because T3 holds it, so
'__down()' which sets the task to TASK_UNINTERRUPTIBLE, puts it on
it's own WAITQUEUE, and calls 'schedule()' which deactivates it
(takes it off the run queue)
T3: 'wake_up()' has in the meantime returned, after a short spin, T3 calls
T1: since T1 is once again in TASK_UNINTERRUPTIBLE and not on the run
queue, T1 gets the wake_up again, but it's still waiting for the
BKL, so nothing happens.
T3: drops the BKL
T1: acquires the BKL and goes on its way, dropping the BKL when it's done
T2: doesn't get signaled because 2 signals were sent by T3 already, so
it doesn't know there's anyone waiting.
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
What info do you need from me?
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).
Clearing needinfo flag - this won't get resolved in RHEL5 and isn't an issue in RHEL6+