Bug 623810

Summary: dlm_controld: ignore plocks until checkpoint time
Product: Red Hat Enterprise Linux 6 Reporter: David Teigland <teigland>
Component: clusterAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: ccaulfie, cluster-maint, djansa, fdinitto, lhh, rpeterso, ssaha, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cluster-3.0.12-25.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 12:53:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch none

Description David Teigland 2010-08-12 20:37:08 UTC
Description of problem:

When dlm_controld joins a cpg and begins receiving plock messages, it needs to save all those plock messages for processing after it initializes plock state from a checkpoint.  Instead of being initialized to 1, saved_plocks started as 0 and was initialized to 1 shortly after the join.  This left a short span of time where a plock message could arrive and be processed immediately instead of saved, which would cause the node's plock state to be out of sync with the other nodes, which could lead to any number of different problems with plocks.

The test used for this is:

node1: mount /gfs; cd /gfs; lock_load -n 1000
node2: mount /gfs; cd /gfs; lock_load -n 1000
node3: mount /gfs; cd /gfs; lock_load -n 1000
node4: mount /gfs; cd /gfs; lock_load -n 1000
node1: kill lock_load; cd; umount /gfs
node2: kill lock_load; cd; umount /gfs
node3: kill lock_load; cd; umount /gfs
node1: mount /gfs; cd /gfs; lock_load -n 1000


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2010-08-12 21:08:19 UTC
Created attachment 438528 [details]
patch

fix

Comment 3 David Teigland 2010-08-13 20:14:45 UTC
The original description is an incorrect diagnosis of the problem, and the patch is an incorrect solution.

Correct diagnosis:

    After our join confchg, we need to ignore plock messages
    until the point in time where the ckpt_node saves plock
    state (final start message received).  At that time we need
    to shift from ignoring plock messages to saving plock
    messages that will be applied on top of the plock state that
    is read from the checkpoint.  The code was not ignoring
    plock messages during the first stage of the process, which
    led to incorrect plock state.

Comment 7 errata-xmlrpc 2011-05-19 12:53:30 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html