+++ This bug was initially created as a clone of Bug #476659 +++ This is a clone of the original problem which reported for RHEL 5. We now have a reported case on RHEL 4. Description of problem: From http://lkml.org/lkml/2007/10/17/314: "We have observed hangs in posix_locks_deadlock() when multiple threads use fcntl(2) F_SETLKW to synchronize file accesses. The problem appears to be due to an error in the implementation of posix_locks_deadlock() in which "goto next_task" is used to break out of the list_for_each_entry() file_lock search after which the posix_same_owner(caller_fl, block_fl) test may evaluate to false and the list_for_each_entry() loop restarts all over again. This in turn leads to a hang where posix_locks_deadlock() never returns. The workaround is to change the posix_same_owner() test within the list_for_each_entry() loop to directly compare caller_fl against current fl entry." A reproducer was posted here: http://lkml.org/lkml/2007/10/17/472 Version-Release number of selected component (if applicable): 2.6.18-* How reproducible: Fairly straightforward - see reproducer at: http://lkml.org/lkml/2007/10/17/472 Steps to Reproduce: 1. Compile fcntltest.c 2. Run fcntltest Actual results: Softlockup warnings appear after some time: Expected results: No softlockup. Additional info: Fixed by 97855b49b6bac0bd25f16b017883634d13591d00, there are a few other related commits since that time, but this is the one that fixes the above lockup. commit 97855b49b6bac0bd25f16b017883634d13591d00 Author: J. Bruce Fields <bfields.edu> Date: Tue Oct 30 11:20:02 2007 -0400 locks: fix possible infinite loop in posix deadlock detection It's currently possible to send posix_locks_deadlock() into an infinite loop (under the BKL). For now, fix this just by bailing out after a few iterations. We may want to fix this in a way that better clarifies the semantics of deadlock detection. But that will take more time, and this minimal fix is probably adequate for any realistic scenario, and is simple enough to be appropriate for applying to stable kernels now. Thanks to George Davis for reporting the problem. Cc: "George G. Davis" <gdavis> Signed-off-by: J. Bruce Fields <bfields.edu> Acked-by: Alan Cox <alan> Signed-off-by: Linus Torvalds <torvalds> diff --git a/fs/locks.c b/fs/locks.c index 0127a28..8b8388e 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -696,17 +696,28 @@ EXPORT_SYMBOL(posix_test_lock); * Note: the above assumption may not be true when handling lock requests * from a broken NFS client. But broken NFS clients have a lot more to * worry about than proper deadlock detection anyway... --okir + * + * However, the failure of this assumption (also possible in the case of + * multiple tasks sharing the same open file table) also means there's no + * guarantee that the loop below will terminate. As a hack, we give up + * after a few iterations. */ + +#define MAX_DEADLK_ITERATIONS 10 + static int posix_locks_deadlock(struct file_lock *caller_fl, struct file_lock *block_fl) { struct file_lock *fl; + int i = 0; next_task: if (posix_same_owner(caller_fl, block_fl)) return 1; list_for_each_entry(fl, &blocked_list, fl_link) { if (posix_same_owner(fl, block_fl)) { + if (i++ > MAX_DEADLK_ITERATIONS) + return 0; fl = fl->fl_next; block_fl = fl; goto next_task;
Created attachment 346640 [details] Proposed patch Backport of git commit 97855b49b6bac0bd25f16b017883634d13591d00
Committed in 89.8.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Verified in -89.8.EL, reproduced in 89.ELsmp. When http://lkml.org/lkml/2007/10/17/472 reproducer is executed in -89.ELsmp, system hangs completely. The machine responds to ping, but no other activity on it is possible. On 89.8.EL the reproducer terminates with segfault after a while. On both kernels it is several times printed out 'XX.txt: Resource deadlock avoided'. No softlookup warnings found on either kernel.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html