623810 – dlm_controld: ignore plocks until checkpoint time

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 623810 - dlm_controld: ignore plocks until checkpoint time

Summary: dlm_controld: ignore plocks until checkpoint time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	cluster
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-08-12 20:37 UTC by David Teigland
Modified:	2011-05-19 12:53 UTC (History)
CC List:	8 users (show)
Fixed In Version:	cluster-3.0.12-25.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-05-19 12:53:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch (2.37 KB, text/plain) 2010-08-12 21:08 UTC, David Teigland	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0537	0	normal	SHIPPED_LIVE	cluster and gfs2-utils bug fix update	2011-05-18 17:57:40 UTC

Description David Teigland 2010-08-12 20:37:08 UTC

Description of problem:

When dlm_controld joins a cpg and begins receiving plock messages, it needs to save all those plock messages for processing after it initializes plock state from a checkpoint.  Instead of being initialized to 1, saved_plocks started as 0 and was initialized to 1 shortly after the join.  This left a short span of time where a plock message could arrive and be processed immediately instead of saved, which would cause the node's plock state to be out of sync with the other nodes, which could lead to any number of different problems with plocks.

The test used for this is:

node1: mount /gfs; cd /gfs; lock_load -n 1000
node2: mount /gfs; cd /gfs; lock_load -n 1000
node3: mount /gfs; cd /gfs; lock_load -n 1000
node4: mount /gfs; cd /gfs; lock_load -n 1000
node1: kill lock_load; cd; umount /gfs
node2: kill lock_load; cd; umount /gfs
node3: kill lock_load; cd; umount /gfs
node1: mount /gfs; cd /gfs; lock_load -n 1000


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2010-08-12 21:08:19 UTC

Created attachment 438528 [details]
patch

fix

Comment 2 David Teigland 2010-08-12 21:34:24 UTC

pushed to RHEL6 branch
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=f63677d0128c4a30d36ccf91adf37840ce08e629

Comment 3 David Teigland 2010-08-13 20:14:45 UTC

The original description is an incorrect diagnosis of the problem, and the patch is an incorrect solution.

Correct diagnosis:

    After our join confchg, we need to ignore plock messages
    until the point in time where the ckpt_node saves plock
    state (final start message received).  At that time we need
    to shift from ignoring plock messages to saving plock
    messages that will be applied on top of the plock state that
    is read from the checkpoint.  The code was not ignoring
    plock messages during the first stage of the process, which
    led to incorrect plock state.

Comment 4 David Teigland 2010-08-13 20:17:46 UTC

reverted the incorrect patch from comment 2
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=9edfc44b223d8919467e02352600bfb874993b09

pushed new fix to RHEL6 branch
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=4de775010f746bf2369e6d4904bc4439769ffb56

Comment 7 errata-xmlrpc 2011-05-19 12:53:30 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html

Note You need to log in before you can comment on or make changes to this bug.