Bug 623810

Summary:

dlm_controld: ignore plocks until checkpoint time

Product:

Red Hat Enterprise Linux 6

Reporter:

David Teigland <teigland>

Component:

cluster

Assignee:

David Teigland <teigland>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

low

Version:

6.0

CC:

ccaulfie, cluster-maint, djansa, fdinitto, lhh, rpeterso, ssaha, teigland

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

cluster-3.0.12-25.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-19 12:53:30 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
patch	none

Description David Teigland 2010-08-12 20:37:08 UTC

Description of problem:

When dlm_controld joins a cpg and begins receiving plock messages, it needs to save all those plock messages for processing after it initializes plock state from a checkpoint.  Instead of being initialized to 1, saved_plocks started as 0 and was initialized to 1 shortly after the join.  This left a short span of time where a plock message could arrive and be processed immediately instead of saved, which would cause the node's plock state to be out of sync with the other nodes, which could lead to any number of different problems with plocks.

The test used for this is:

node1: mount /gfs; cd /gfs; lock_load -n 1000
node2: mount /gfs; cd /gfs; lock_load -n 1000
node3: mount /gfs; cd /gfs; lock_load -n 1000
node4: mount /gfs; cd /gfs; lock_load -n 1000
node1: kill lock_load; cd; umount /gfs
node2: kill lock_load; cd; umount /gfs
node3: kill lock_load; cd; umount /gfs
node1: mount /gfs; cd /gfs; lock_load -n 1000


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2010-08-12 21:08:19 UTC

Created attachment 438528 [details]
patch

fix

Comment 2 David Teigland 2010-08-12 21:34:24 UTC

pushed to RHEL6 branch
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=f63677d0128c4a30d36ccf91adf37840ce08e629

Comment 3 David Teigland 2010-08-13 20:14:45 UTC

The original description is an incorrect diagnosis of the problem, and the patch is an incorrect solution.

Correct diagnosis:

    After our join confchg, we need to ignore plock messages
    until the point in time where the ckpt_node saves plock
    state (final start message received).  At that time we need
    to shift from ignoring plock messages to saving plock
    messages that will be applied on top of the plock state that
    is read from the checkpoint.  The code was not ignoring
    plock messages during the first stage of the process, which
    led to incorrect plock state.

Comment 4 David Teigland 2010-08-13 20:17:46 UTC

reverted the incorrect patch from comment 2
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=9edfc44b223d8919467e02352600bfb874993b09

pushed new fix to RHEL6 branch
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=4de775010f746bf2369e6d4904bc4439769ffb56

Comment 7 errata-xmlrpc 2011-05-19 12:53:30 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html