Bug 474163 - gfs_controld: receive_own from N messages with plock_ownership enabled
Summary: gfs_controld: receive_own from N messages with plock_ownership enabled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.3
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
: 512799 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-12-02 15:49 UTC by Nate Straz
Modified: 2009-09-02 11:10 UTC (History)
7 users (show)

Fixed In Version: cman-2.0.100-1.el5
Doc Type: Bug Fix
Doc Text:
- cause: enable plock_ownership for gfs_controld - consequence: nodes mounting gfs have inconsistent views of posix locks on the fs - fix: reverse inverted mode of synced locks, and sync resources with no locks so that the resource owner is synced - result: posix locks remain consistent among nodes when plock_ownership is enabled
Clone Of:
Environment:
Last Closed: 2009-09-02 11:10:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1341 0 normal SHIPPED_LIVE Low: cman security, bug fix, and enhancement update 2009-09-01 10:43:16 UTC

Description Nate Straz 2008-12-02 15:49:41 UTC
Description of problem:

After a failure in revolver with plock_ownership enabled, I found the following messages in syslog.

receive_own from 7 5f6a02a info nodeid 0 r owner -1

They occur on all nodes, but they started after the first iteration of revolver completed and the file system was remounted.

Scenario iteration 0.1 started at Mon Dec  1 12:38:55 CST 2008
Those picked to face the revolver... tank-01 

Scenario iteration 0.2 started at Mon Dec  1 12:46:38 CST 2008

Dec  1 12:46:16 tank-01 gfs_controld[2877]: receive_own from 7 5f6a02a info nodeid 0 r owner -1
Dec  1 12:46:46 tank-03 gfs_controld[20850]: receive_own from 1 4005a info nodeid 0 r owner 0
Dec  1 12:46:46 tank-04 gfs_controld[20769]: receive_own from 1 4005a info nodeid 0 r owner 0
Dec  1 12:46:46 morph-01 gfs_controld[22805]: receive_own from 1 4005a info nodeid 0 r owner 0
Dec  1 12:46:46 morph-04 gfs_controld[20808]: receive_own from 1 4005a info nodeid 0 r owner 0

Version-Release number of selected component (if applicable):
cman-2.0.97-1.el5

How reproducible:
I hit this each of three times I restart revolver yesterday.

Steps to Reproduce:
1.  Setup a cluster with plock_ownership
2.  Mount file systems and run a load with plocks
3.  Reboot one node and remount file systems
  
Actual results:
Lots of the above messages in /var/log/messages.  The test load was not affected.

Expected results:
No scarey messages in /var/log/messages.

Additional info:

Comment 1 David Teigland 2008-12-03 23:16:59 UTC
Here's a fix to at least one problem that can lead to all sorts of chaos;
there may be others.

commit 4cae2bce5468278206bbac2b46bb7e18a2693c43
Author: David Teigland <teigland>
Date:   Wed Dec 3 17:07:34 2008 -0600

    gfs_controld: fix lock syncing in ownership mode
    
    bz 474163
    
    Locks that are synced due to a resource being "un-owned"
    were having their read/write mode reversed on the nodes
    being synced to.  This causes the plock state on the nodes
    to become out of sync, and operate wrongly.
    
    Signed-off-by: David Teigland <teigland>

diff --git a/group/gfs_controld/plock.c b/group/gfs_controld/plock.c
index 1b9bbe2..aa59ea3 100644
--- a/group/gfs_controld/plock.c
+++ b/group/gfs_controld/plock.c
@@ -1401,7 +1401,7 @@ static void _receive_sync(struct mountgroup *mg, char *bu
        }
 
        if (hd->type == MSG_PLOCK_SYNC_LOCK)
-               add_lock(r, info.nodeid, info.owner, info.pid, !info.ex, 
+               add_lock(r, info.nodeid, info.owner, info.pid, info.ex, 
                         info.start, info.end);
        else if (hd->type == MSG_PLOCK_SYNC_WAITER)
                add_waiter(mg, r, &info);

Comment 2 David Teigland 2008-12-04 16:23:53 UTC
Using the fix in comment 1, I've not been able to reproduce any of the
errors I saw before, or that Nate saw before.

Fix pushed to git branches:
STABLE2 gfs_controld
RHEL5 gfs_controld
master gfs_controld and dlm_controld

Comment 3 David Teigland 2008-12-11 23:51:15 UTC
Found another major bug with ownership code: new nodes mounting and
syncing plock state from existing nodes were ignoring resources without
locks. This means the new nodes would have no record of resources owned
by other nodes.  I'll fix this as a part of this bz as well.

Comment 4 David Teigland 2008-12-12 17:42:55 UTC
Fix for comment 3 pushed to git branches.

gfs_controld: read lockless resources from ckpts

STABLE2 gfs_controld d09d455d3bf5fa93eed53506856c47b30d27f775
RHEL5   gfs_controld 3d06a50da2541c94027392e39863fa61ac5f0214
master  gfs_controld and dlm_controld d8b4d87c8fd9502aea03f1782a3178b2828ef0d2

Comment 5 Fedora Update System 2009-01-24 02:36:09 UTC
gfs2-utils-2.03.11-1.fc9, cman-2.03.11-1.fc9, rgmanager-2.03.11-1.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 8 David Teigland 2009-05-19 16:02:52 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
- cause: enable plock_ownership for gfs_controld

- consequence: nodes mounting gfs have inconsistent views of posix locks on the fs

- fix: reverse inverted mode of synced locks, and sync resources with no locks so that the resource owner is synced

- result: posix locks remain consistent among nodes when plock_ownership is enabled

Comment 9 Lon Hohberger 2009-07-22 20:27:54 UTC
*** Bug 512799 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2009-09-02 11:10:50 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1341.html


Note You need to log in before you can comment on or make changes to this bug.