Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 178738

Summary:

transient hangs caused by ALTMODE bug

Product:

[Retired] Red Hat Cluster Suite

Reporter:

David Teigland <teigland>

Component:

dlm

Assignee:

David Teigland <teigland>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

ccaulfie, cluster-maint

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2006-0237

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-03-09 19:55:18 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
rand_direct program mentioned in description	none

Description David Teigland 2006-01-23 21:38:09 UTC

Description of problem:

When GFS uses direct-io PR and CW locks are mixed together
on a single resource.  To optimize the interaction between
these two lock modes, GFS uses LM_FLAG_ANY to request that
either of the modes be granted.  When the dlm carries out
this optimization and grants a PR lock instead of a CW, or
a CW instead of a PR, the mode is not switched on the non-
master node.  So, for example, the lock will be requested
in PR mode with the ALTCW flag, it will be granted on the
master node in CW mode, but the non master (requesting)
node will record the granted mode as PR.

In the test used to uncover this bug, the outward sign of
trouble was transient hangs in the test program for 2-4
minutes at a time (until the bad lock was released by
gfs's normal drop logic.)  It's not clear if there could
be more severe consequences from this bug or not, only
applications using direct-io would be effected.

Version-Release number of selected component (if applicable):

All current RHEL4 code.

How reproducible:

unknown

Steps to Reproduce:
1. run rand_direct -s on some nodes
2. run make_panic -r 10 -l 100 on other nodes
3.
  
Actual results:

tests hang on all nodes for a few minutes at a time,
then run for less than a minute until next hang

Expected results:

all tests run indefinately

Additional info:

Comment 1 David Teigland 2006-01-23 21:38:09 UTC

Created attachment 123593 [details]
rand_direct program mentioned in description

Comment 2 David Teigland 2006-01-24 14:35:35 UTC

Fix by changing the grmode on the non-master node when we
get ALTMODE back from the master.

[cluster-STABLE/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.6.6.5; previous revision: 1.37.2.6.6.4
done

[cluster-RHEL4/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.9; previous revision: 1.37.2.8
done

Comment 3 David Teigland 2006-01-24 17:17:26 UTC

In the situation described above, the correct lock mode
is always returned to GFS, which means it's less likely
there are more severe problems than the hang already
described.

Comment 4 David Teigland 2006-01-24 17:43:01 UTC

[cluster-RHEL4U3/dlm-kernel/src]% cvs commit
cvs commit: Examining .
Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.6.10.2; previous revision: 1.37.2.6.10.1
done

Comment 7 Red Hat Bugzilla 2006-03-09 19:55:18 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0237.html