Bug 178738 - transient hangs caused by ALTMODE bug
Summary: transient hangs caused by ALTMODE bug
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: dlm
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-01-23 21:38 UTC by David Teigland
Modified: 2009-04-16 20:30 UTC (History)
2 users (show)

Fixed In Version: RHBA-2006-0237
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-09 19:55:18 UTC
Embargoed:


Attachments (Terms of Use)
rand_direct program mentioned in description (3.23 KB, text/plain)
2006-01-23 21:38 UTC, David Teigland
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0237 0 normal SHIPPED_LIVE dlm-kernel bug fix update 2006-03-09 05:00:00 UTC

Description David Teigland 2006-01-23 21:38:09 UTC
Description of problem:

When GFS uses direct-io PR and CW locks are mixed together
on a single resource.  To optimize the interaction between
these two lock modes, GFS uses LM_FLAG_ANY to request that
either of the modes be granted.  When the dlm carries out
this optimization and grants a PR lock instead of a CW, or
a CW instead of a PR, the mode is not switched on the non-
master node.  So, for example, the lock will be requested
in PR mode with the ALTCW flag, it will be granted on the
master node in CW mode, but the non master (requesting)
node will record the granted mode as PR.

In the test used to uncover this bug, the outward sign of
trouble was transient hangs in the test program for 2-4
minutes at a time (until the bad lock was released by
gfs's normal drop logic.)  It's not clear if there could
be more severe consequences from this bug or not, only
applications using direct-io would be effected.

Version-Release number of selected component (if applicable):

All current RHEL4 code.

How reproducible:

unknown

Steps to Reproduce:
1. run rand_direct -s on some nodes
2. run make_panic -r 10 -l 100 on other nodes
3.
  
Actual results:

tests hang on all nodes for a few minutes at a time,
then run for less than a minute until next hang

Expected results:

all tests run indefinately

Additional info:

Comment 1 David Teigland 2006-01-23 21:38:09 UTC
Created attachment 123593 [details]
rand_direct program mentioned in description

Comment 2 David Teigland 2006-01-24 14:35:35 UTC
Fix by changing the grmode on the non-master node when we
get ALTMODE back from the master.

[cluster-STABLE/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.6.6.5; previous revision: 1.37.2.6.6.4
done

[cluster-RHEL4/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.9; previous revision: 1.37.2.8
done


Comment 3 David Teigland 2006-01-24 17:17:26 UTC
In the situation described above, the correct lock mode
is always returned to GFS, which means it's less likely
there are more severe problems than the hang already
described.


Comment 4 David Teigland 2006-01-24 17:43:01 UTC
[cluster-RHEL4U3/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.6.10.2; previous revision: 1.37.2.6.10.1
done


Comment 7 Red Hat Bugzilla 2006-03-09 19:55:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0237.html



Note You need to log in before you can comment on or make changes to this bug.