Bug 476659 - softlockups due to infinite loops in posix_locks_deadlock
softlockups due to infinite loops in posix_locks_deadlock
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
All Linux
urgent Severity high
: rc
: ---
Assigned To: Josef Bacik
Red Hat Kernel QE team
: ZStream
Depends On:
Blocks: 483701 485920 496842 504279 546230
  Show dependency treegraph
 
Reported: 2008-12-16 09:27 EST by Bryn M. Reeves
Modified: 2010-10-23 02:36 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 504279 (view as bug list)
Environment:
Last Closed: 2009-09-02 04:07:05 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bryn M. Reeves 2008-12-16 09:27:31 EST
Description of problem:

From http://lkml.org/lkml/2007/10/17/314:

"We have observed hangs in posix_locks_deadlock() when multiple threads
use fcntl(2) F_SETLKW to synchronize file accesses.  The problem appears
to be due to an error in the implementation of posix_locks_deadlock() in
which "goto next_task" is used to break out of the list_for_each_entry()
file_lock search after which the posix_same_owner(caller_fl, block_fl)
test may evaluate to false and the list_for_each_entry() loop restarts
all over again.  This in turn leads to a hang where posix_locks_deadlock()
never returns.  The workaround is to change the posix_same_owner()
test within the list_for_each_entry() loop to directly compare caller_fl
against current fl entry."

A reproducer was posted here:

http://lkml.org/lkml/2007/10/17/472

Version-Release number of selected component (if applicable):
2.6.18-*

How reproducible:
Fairly straightforward - see reproducer at: http://lkml.org/lkml/2007/10/17/472

Steps to Reproduce:
1. Compile fcntltest.c
2. Run fcntltest

  
Actual results:
Softlockup warnings appear after some time:



Expected results:
No softlockup.

Additional info:
Fixed by 97855b49b6bac0bd25f16b017883634d13591d00, there are a few other related commits since that time, but this is the one that fixes the above lockup.

commit 97855b49b6bac0bd25f16b017883634d13591d00
Author: J. Bruce Fields <bfields@citi.umich.edu>
Date:   Tue Oct 30 11:20:02 2007 -0400

    locks: fix possible infinite loop in posix deadlock detection
    
    It's currently possible to send posix_locks_deadlock() into an infinite
    loop (under the BKL).
    
    For now, fix this just by bailing out after a few iterations.  We may
    want to fix this in a way that better clarifies the semantics of
    deadlock detection.  But that will take more time, and this minimal fix
    is probably adequate for any realistic scenario, and is simple enough to
    be appropriate for applying to stable kernels now.
    
    Thanks to George Davis for reporting the problem.
    
    Cc: "George G. Davis" <gdavis@mvista.com>
    Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
    Acked-by: Alan Cox <alan@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/fs/locks.c b/fs/locks.c
index 0127a28..8b8388e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -696,17 +696,28 @@ EXPORT_SYMBOL(posix_test_lock);
  * Note: the above assumption may not be true when handling lock requests
  * from a broken NFS client. But broken NFS clients have a lot more to
  * worry about than proper deadlock detection anyway... --okir
+ *
+ * However, the failure of this assumption (also possible in the case of
+ * multiple tasks sharing the same open file table) also means there's no
+ * guarantee that the loop below will terminate.  As a hack, we give up
+ * after a few iterations.
  */
+
+#define MAX_DEADLK_ITERATIONS 10
+
 static int posix_locks_deadlock(struct file_lock *caller_fl,
                                struct file_lock *block_fl)
 {
        struct file_lock *fl;
+       int i = 0;
 
 next_task:
        if (posix_same_owner(caller_fl, block_fl))
                return 1;
        list_for_each_entry(fl, &blocked_list, fl_link) {
                if (posix_same_owner(fl, block_fl)) {
+                       if (i++ > MAX_DEADLK_ITERATIONS)
+                               return 0;
                        fl = fl->fl_next;
                        block_fl = fl;
                        goto next_task;
Comment 2 Issue Tracker 2008-12-23 15:42:47 EST
I'm working on a one-off hotfix kernel build to provide the customer until
we can get the Z-stream release done.

Barring any great difficulties due to the # of patches against the RHEL5
kernel, it should be ready in a bit.  Will keep ticket updated with
status.

--vince


This event sent from IssueTracker by vincew 
 issue 240372
Comment 3 Issue Tracker 2008-12-23 17:10:09 EST
The hotfix kernel build is underway.  Here's the Brew task URL:

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1632927

I will check in on this periodically to monitor its progress.

--vince


This event sent from IssueTracker by vincew 
 issue 240372
Comment 5 RHEL Product and Program Management 2009-02-16 10:12:55 EST
Updating PM score.
Comment 14 Don Zickus 2009-05-06 13:15:24 EDT
in kernel-2.6.18-144.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 19 errata-xmlrpc 2009-09-02 04:07:05 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.