Bug 178738 - transient hangs caused by ALTMODE bug
transient hangs caused by ALTMODE bug
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-01-23 16:38 EST by David Teigland
Modified: 2009-04-16 16:30 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2006-0237
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-03-09 14:55:18 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
rand_direct program mentioned in description (3.23 KB, text/plain)
2006-01-23 16:38 EST, David Teigland
no flags Details

  None (edit)
Description David Teigland 2006-01-23 16:38:09 EST
Description of problem:

When GFS uses direct-io PR and CW locks are mixed together
on a single resource.  To optimize the interaction between
these two lock modes, GFS uses LM_FLAG_ANY to request that
either of the modes be granted.  When the dlm carries out
this optimization and grants a PR lock instead of a CW, or
a CW instead of a PR, the mode is not switched on the non-
master node.  So, for example, the lock will be requested
in PR mode with the ALTCW flag, it will be granted on the
master node in CW mode, but the non master (requesting)
node will record the granted mode as PR.

In the test used to uncover this bug, the outward sign of
trouble was transient hangs in the test program for 2-4
minutes at a time (until the bad lock was released by
gfs's normal drop logic.)  It's not clear if there could
be more severe consequences from this bug or not, only
applications using direct-io would be effected.

Version-Release number of selected component (if applicable):

All current RHEL4 code.

How reproducible:

unknown

Steps to Reproduce:
1. run rand_direct -s on some nodes
2. run make_panic -r 10 -l 100 on other nodes
3.
  
Actual results:

tests hang on all nodes for a few minutes at a time,
then run for less than a minute until next hang

Expected results:

all tests run indefinately

Additional info:
Comment 1 David Teigland 2006-01-23 16:38:09 EST
Created attachment 123593 [details]
rand_direct program mentioned in description
Comment 2 David Teigland 2006-01-24 09:35:35 EST
Fix by changing the grmode on the non-master node when we
get ALTMODE back from the master.

[cluster-STABLE/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.6.6.5; previous revision: 1.37.2.6.6.4
done

[cluster-RHEL4/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.9; previous revision: 1.37.2.8
done
Comment 3 David Teigland 2006-01-24 12:17:26 EST
In the situation described above, the correct lock mode
is always returned to GFS, which means it's less likely
there are more severe problems than the hang already
described.
Comment 4 David Teigland 2006-01-24 12:43:01 EST
[cluster-RHEL4U3/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.6.10.2; previous revision: 1.37.2.6.10.1
done
Comment 7 Red Hat Bugzilla 2006-03-09 14:55:18 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0237.html

Note You need to log in before you can comment on or make changes to this bug.