Bug 623810 - dlm_controld: ignore plocks until checkpoint time
dlm_controld: ignore plocks until checkpoint time
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-12 16:37 EDT by David Teigland
Modified: 2011-05-19 08:53 EDT (History)
8 users (show)

See Also:
Fixed In Version: cluster-3.0.12-25.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-19 08:53:30 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch (2.37 KB, text/plain)
2010-08-12 17:08 EDT, David Teigland
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0537 normal SHIPPED_LIVE cluster and gfs2-utils bug fix update 2011-05-18 13:57:40 EDT

  None (edit)
Description David Teigland 2010-08-12 16:37:08 EDT
Description of problem:

When dlm_controld joins a cpg and begins receiving plock messages, it needs to save all those plock messages for processing after it initializes plock state from a checkpoint.  Instead of being initialized to 1, saved_plocks started as 0 and was initialized to 1 shortly after the join.  This left a short span of time where a plock message could arrive and be processed immediately instead of saved, which would cause the node's plock state to be out of sync with the other nodes, which could lead to any number of different problems with plocks.

The test used for this is:

node1: mount /gfs; cd /gfs; lock_load -n 1000
node2: mount /gfs; cd /gfs; lock_load -n 1000
node3: mount /gfs; cd /gfs; lock_load -n 1000
node4: mount /gfs; cd /gfs; lock_load -n 1000
node1: kill lock_load; cd; umount /gfs
node2: kill lock_load; cd; umount /gfs
node3: kill lock_load; cd; umount /gfs
node1: mount /gfs; cd /gfs; lock_load -n 1000


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 David Teigland 2010-08-12 17:08:19 EDT
Created attachment 438528 [details]
patch

fix
Comment 3 David Teigland 2010-08-13 16:14:45 EDT
The original description is an incorrect diagnosis of the problem, and the patch is an incorrect solution.

Correct diagnosis:

    After our join confchg, we need to ignore plock messages
    until the point in time where the ckpt_node saves plock
    state (final start message received).  At that time we need
    to shift from ignoring plock messages to saving plock
    messages that will be applied on top of the plock state that
    is read from the checkpoint.  The code was not ignoring
    plock messages during the first stage of the process, which
    led to incorrect plock state.
Comment 7 errata-xmlrpc 2011-05-19 08:53:30 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html

Note You need to log in before you can comment on or make changes to this bug.