499507 – 'CONFIG_PREEMPT_BKL=y' in RHEL5 kernel config breaks the 1-1 relationship between 'exclusive' schedule() calls and wake_up()

Bug 499507 - 'CONFIG_PREEMPT_BKL=y' in RHEL5 kernel config breaks the 1-1 relationship between 'exclusive' schedule() calls and wake_up()

Summary: 'CONFIG_PREEMPT_BKL=y' in RHEL5 kernel config breaks the 1-1 relationship bet...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Zijlstra
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-05-06 21:48 UTC by AJ Lewis
Modified:	2014-08-11 05:40 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-06-03 12:48:59 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Tarball containing test code to exercise the described bug (5.69 KB, application/x-bzip) 2009-05-06 21:48 UTC, AJ Lewis	no flags	Details
View All

Description AJ Lewis 2009-05-06 21:48:39 UTC

Created attachment 342733 [details]
Tarball containing test code to exercise the described bug

Description of problem:

CONFIG_PREEMPT_BKL=y in kernel config can cause kernel threads woken
from a wait queue to end up back in schedule() with a status of
TASK_UNINTERRUPTIBLE, which means that anyone that tries to wake the
next thread on the wait queue will simply signal the original thread
to wake again.  This means that wake_up() on 'exclusive' scheduled
kernel threads is not a 1-1 relationship as it is when
CONFIG_PREEMPT_BKL is not on.

Version-Release number of selected component (if applicable):

2.6.18-53.1.4.el5
2.6.18-92.el5
2.6.18-128.el5

How reproducible:

Since this is exercising the BKL, this bug only shows up on SMP systems.

Very reproducible with attached test (simple kernel module, binary,
and test script to reproduce bug).  

More difficult to reproduce in normal situations.  There is a race
between the activation/status change of the task by 'try_to_wake_up()'
and the deactivation/status change of the task by
'reacquire_kernel_lock()'.  There also needs to be another thread
holding the BKL while the original woken thread is attempting to
reacquire it, *and* another thread needs to be trying to wake the next
item on the wait queue.  This is probably only possible on 3+ core
systems, though that is rather common these days.

Steps to Reproduce:

1. Run test script 'test_bklbug.sh' in attachment as root. This will
compile the code, create the character device node, and run the test,
cleaning up after itself when finished.  The test should take less
than 10 seconds to run.

If the bug is detected, the test script will output:
*** BKL BUG PRESENT ***
if not, it will output:
*** NO BKL BUG DETECTED ***

Actual results:

On kernels with CONFIG_PREEMPT_BKL=y, one thread does not get woken.
The dmesg output will show that the same thread gets signaled twice.

This is because wake_up() is not guaranteed to be one-to-one on
threads that hold the BKL and are scheduled exclusive, due to the fact
that the thread can get back into schedule() after waking up if
someone currently holds the BKL.

Expected results:

On kernels with CONFIG_PREEMPT_BKL=n (or on kernels where it doesn't
exist - RHEL4 kernels for example) both threads get woken.

This is because wake_up() is always one-to-one on threads that are
scheduled exclusive, regardless of whether they hold the BKL or not.

Additional info:

I am an employee of Quantum Corp, and this is an issue that a customer
hit using StorNext on RHEL5.  The expectation of the StorNext file
system is that wake_up() will wake one, and only one, exclusive
waitqueue task, and once woken, will not signal it again.  This has
been valid on linux in 2.4 kernels, and is currently valid on RHEL4
and SLES10 systems.

This bug showed up with the addition of the CONFIG_PREEMPT_BKL kernel
config option in 2.6.11 (http://kerneltrap.org/node/3843).  It has
since been "fixed" in 2.6.26 when Linus reverted most of the BKL
changes that had happened since 2.6.7.
(http://kerneltrap.org/Linux/Removing_the_Big_Kernel_Lock)

The order of events to cause the problem in the test module is as follows:

Sleeper 1 == T1
Sleeper 2 == T2
Waker     == T3

T1: comes in via vfs ioctl, which grabs the BKL
T1: calls 'schedule()', BKL dropped
T2: comes in via vfs ioctl, which grabs the BKL
T2: calls 'schedule()', BKL dropped
T3: grabs BKL
T3: calls 'wake_up()'

T1: 'try_to_wake_up()' operates on T1, which activates it (puts it on
    the run queue), sets it's task to TASK_RUNNING

T1: starts to come out of 'schedule()' which calls
    'reacquire_kernel_lock()' to grab the BKL, which it can't get
    because T3 holds it, so
    'reacquire_kernel_lock()->__reacquire_kernel_lock()->down()' calls
    '__down()' which sets the task to TASK_UNINTERRUPTIBLE, puts it on
    it's own WAITQUEUE, and calls 'schedule()' which deactivates it
    (takes it off the run queue)
    
T3: 'wake_up()' has in the meantime returned, after a short spin, T3 calls
    'wake_up()' again

T1: since T1 is once again in TASK_UNINTERRUPTIBLE and not on the run
    queue, T1 gets the wake_up again, but it's still waiting for the
    BKL, so nothing happens.

T3: drops the BKL

T1: acquires the BKL and goes on its way, dropping the BKL when it's done

T2: doesn't get signaled because 2 signals were sent by T3 already, so
    it doesn't know there's anyone waiting.

Comment 1 RHEL Program Management 2014-03-07 13:43:06 UTC

This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.

Comment 2 AJ Lewis 2014-03-31 14:10:24 UTC

What info do you need from me?

Comment 3 RHEL Program Management 2014-06-03 12:48:59 UTC

Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).

Comment 4 AJ Lewis 2014-06-23 11:32:56 UTC

Clearing needinfo flag - this won't get resolved in RHEL5 and isn't an issue in RHEL6+

Note You need to log in before you can comment on or make changes to this bug.