Bug 138384

Summary:	posix_locks_deadlock() getting stuck in endless loop
Product:	Red Hat Enterprise Linux 3	Reporter:	David Lehman <dlehman>
Component:	kernel	Assignee:	Frank Hirtz <fhirtz>
Status:	CLOSED ERRATA	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.0	CC:	petrides, riel, tao
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-12-20 20:56:56 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Lehman 2004-11-08 19:32:41 UTC

*** This bug has been split off bug 132540 ***

Description of problem:

posix_locks_deadlock() is getting stuck in an endless loop when 
running samba stress.  This is because samba is using both flocks and 
posix locks, and, when flocks are blocked, they are added to the 
blocked_list without first checking for possible deadlocks with the 
function posix_locks_deadlock()--whereas all posix lock requests are 
checked for possible deadlocks *before* they are added to 
blocked_list.

When there is a circular dependency in blocked_list, 
posix_locks_deadlock gets stuck in that circle.

The fix is to not add flock requests to blocked_list when they are 
blocked.  blocked_list is only used to check for possible deadlocks, 
which should only be done for posix locks.  Here's a patch:


--- locks.c.orig	2004-09-14 11:12:26.000000000 -0500
+++ locks.c	2004-09-14 11:13:32.000000000 -0500
@@ -459,7 +459,8 @@ static void locks_insert_block(struct fi
 	}
 	list_add_tail(&waiter->fl_block, &blocker->fl_block);
 	waiter->fl_next = blocker;
-	list_add(&waiter->fl_link, &blocked_list);
+	if (IS_POSIX(blocker))
+		list_add(&waiter->fl_link, &blocked_list);
 }
 
 /* Wake up processes blocked waiting for blocker.



Version-Release number of selected component (if applicable):
2.6.7-1.451.2.3

How reproducible:
reproducible every time with the right setup, but it's hard to get 
the right setup.

Steps to Reproduce:

1.connect a SCSI drive and a USB hard drive to a server
2.share the SCSI drive, the USB drive, and a RAM drive with samba
3.connect 30 clients to the server, make each run 3 threads of 
network stress--one to each of the three samba shares
4.system fails in an hour or so  

Actual results:

system hangs, but SysRq is generally still functional.  SysRq shows 
that one task is stuck in posix_locks_deadlock(), while others are 
waiting for the big kernel lock that is held by posix_locks_deadlock
().


Expected results:

system should continue to run without problems.

Additional info:

------- Additional comment by Jason Baron on 2004.10.21 14:48 -------

this patch is included the lattest beta of rhel4.

Comment 2 Rik van Riel 2004-11-08 19:40:37 UTC

Why are you reporting a RHEL3 bug for kernel 2.6.7-1.451.2.3 ?

Comment 3 David Lehman 2004-11-08 19:47:58 UTC

My mistake. 

This bug was reported against kernel-2.4.21-20.EL.

Comment 6 Ernie Petrides 2004-11-08 19:56:06 UTC

This problem was already fixed in U4 (on 22-Sep-2004 in kernel
version 2.4.21-20.10.EL).  Please verify fix in current U4 beta.

Comment 7 John Flanagan 2004-12-20 20:56:56 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html