Bug 623810 - dlm_controld: ignore plocks until checkpoint time
Summary: dlm_controld: ignore plocks until checkpoint time
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster (Show other bugs)
(Show other bugs)
Version: 6.0
Hardware: All Linux
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
Depends On:
TreeView+ depends on / blocked
Reported: 2010-08-12 20:37 UTC by David Teigland
Modified: 2011-05-19 12:53 UTC (History)
8 users (show)

Fixed In Version: cluster-3.0.12-25.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2011-05-19 12:53:30 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
patch (2.37 KB, text/plain)
2010-08-12 21:08 UTC, David Teigland
no flags Details

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0537 normal SHIPPED_LIVE cluster and gfs2-utils bug fix update 2011-05-18 17:57:40 UTC

Description David Teigland 2010-08-12 20:37:08 UTC
Description of problem:

When dlm_controld joins a cpg and begins receiving plock messages, it needs to save all those plock messages for processing after it initializes plock state from a checkpoint.  Instead of being initialized to 1, saved_plocks started as 0 and was initialized to 1 shortly after the join.  This left a short span of time where a plock message could arrive and be processed immediately instead of saved, which would cause the node's plock state to be out of sync with the other nodes, which could lead to any number of different problems with plocks.

The test used for this is:

node1: mount /gfs; cd /gfs; lock_load -n 1000
node2: mount /gfs; cd /gfs; lock_load -n 1000
node3: mount /gfs; cd /gfs; lock_load -n 1000
node4: mount /gfs; cd /gfs; lock_load -n 1000
node1: kill lock_load; cd; umount /gfs
node2: kill lock_load; cd; umount /gfs
node3: kill lock_load; cd; umount /gfs
node1: mount /gfs; cd /gfs; lock_load -n 1000

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:

Comment 1 David Teigland 2010-08-12 21:08:19 UTC
Created attachment 438528 [details]


Comment 3 David Teigland 2010-08-13 20:14:45 UTC
The original description is an incorrect diagnosis of the problem, and the patch is an incorrect solution.

Correct diagnosis:

    After our join confchg, we need to ignore plock messages
    until the point in time where the ckpt_node saves plock
    state (final start message received).  At that time we need
    to shift from ignoring plock messages to saving plock
    messages that will be applied on top of the plock state that
    is read from the checkpoint.  The code was not ignoring
    plock messages during the first stage of the process, which
    led to incorrect plock state.

Comment 7 errata-xmlrpc 2011-05-19 12:53:30 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.