Bug 178738

Summary: transient hangs caused by ALTMODE bug
Product: [Retired] Red Hat Cluster Suite Reporter: David Teigland <teigland>
Component: dlmAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: ccaulfie, cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2006-0237 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-09 19:55:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rand_direct program mentioned in description none

Description David Teigland 2006-01-23 21:38:09 UTC
Description of problem:

When GFS uses direct-io PR and CW locks are mixed together
on a single resource.  To optimize the interaction between
these two lock modes, GFS uses LM_FLAG_ANY to request that
either of the modes be granted.  When the dlm carries out
this optimization and grants a PR lock instead of a CW, or
a CW instead of a PR, the mode is not switched on the non-
master node.  So, for example, the lock will be requested
in PR mode with the ALTCW flag, it will be granted on the
master node in CW mode, but the non master (requesting)
node will record the granted mode as PR.

In the test used to uncover this bug, the outward sign of
trouble was transient hangs in the test program for 2-4
minutes at a time (until the bad lock was released by
gfs's normal drop logic.)  It's not clear if there could
be more severe consequences from this bug or not, only
applications using direct-io would be effected.

Version-Release number of selected component (if applicable):

All current RHEL4 code.

How reproducible:

unknown

Steps to Reproduce:
1. run rand_direct -s on some nodes
2. run make_panic -r 10 -l 100 on other nodes
3.
  
Actual results:

tests hang on all nodes for a few minutes at a time,
then run for less than a minute until next hang

Expected results:

all tests run indefinately

Additional info:

Comment 1 David Teigland 2006-01-23 21:38:09 UTC
Created attachment 123593 [details]
rand_direct program mentioned in description

Comment 2 David Teigland 2006-01-24 14:35:35 UTC
Fix by changing the grmode on the non-master node when we
get ALTMODE back from the master.

[cluster-STABLE/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.6.6.5; previous revision: 1.37.2.6.6.4
done

[cluster-RHEL4/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.9; previous revision: 1.37.2.8
done


Comment 3 David Teigland 2006-01-24 17:17:26 UTC
In the situation described above, the correct lock mode
is always returned to GFS, which means it's less likely
there are more severe problems than the hang already
described.


Comment 4 David Teigland 2006-01-24 17:43:01 UTC
[cluster-RHEL4U3/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.6.10.2; previous revision: 1.37.2.6.10.1
done


Comment 7 Red Hat Bugzilla 2006-03-09 19:55:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0237.html