Bug 450132

Summary: dlm: fixes for recovery of user lockspace
Product: Red Hat Enterprise Linux 5 Reporter: David Teigland <teigland>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 5.2CC: ccaulfie, edamato, lwang
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:22:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Teigland 2008-06-05 14:46:48 UTC
Description of problem:

Fix bugs when userland apps using the dlm join/leave the lockspace,
causing recovery.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2008-06-05 15:13:59 UTC
fixed in nine upstream commits:

From 8a358ca8e738b6226b004efea462ac28c0a2bbb1 Mon Sep 17 00:00:00 2001
From: David Teigland <teigland>
Date: Mon, 7 Jan 2008 15:55:18 -0600
Subject: [PATCH] dlm: clear ast_type when removing from astqueue

The lkb_ast_type field indicates whether the lkb is on the astqueue list.
When clearing locks for a process, lkb's were being removed from the astqueue
list without clearing the field.  If release_lockspace then happened
immediately afterward, it could try to remove the lkb from the list a second
time.

Appears when process calls libdlm dlm_release_lockspace() which first
closes the ls dev triggering clear_proc_locks, and then removes the ls
(a write to control dev) causing release_lockspace().

Signed-off-by: David Teigland <teigland>


From 601342ce022b964f756b67f2eb99b605c1afa3ed Mon Sep 17 00:00:00 2001
From: David Teigland <teigland>
Date: Mon, 7 Jan 2008 16:15:05 -0600
Subject: [PATCH] dlm: recover locks waiting for overlap replies

When recovery looks at locks waiting for replies, it fails to consider
locks that have already received a reply for their first remote operation,
but not received a reply for secondary, overlapping unlock/cancel.  The
appropriate stub reply needs to be called for these waiters.

Appears when we start doing recovery in the presence of a many overlapping
unlock/cancel ops.

Signed-off-by: David Teigland <teigland>


From aec64e1be2225c6fc64499594d23257c6adf6168 Mon Sep 17 00:00:00 2001
From: David Teigland <teigland>
Date: Tue, 8 Jan 2008 15:37:47 -0600
Subject: [PATCH] dlm: another call to confirm_master in receive_request_reply

When a failed request (EBADR or ENOTBLK) is unlocked/canceled instead of
retried, there may be other lkb's waiting on the rsb_lookup list for it
to complete.  A call to confirm_master() is needed to move on to the next
waiting lkb since the current one won't be retried.

Signed-off-by: David Teigland <teigland>


From 46b43eed7018bab3a4e8c259eed27697b9170cb8 Mon Sep 17 00:00:00 2001
From: David Teigland <teigland>
Date: Tue, 8 Jan 2008 16:24:00 -0600
Subject: [PATCH] dlm: reject messages from non-members

Messages from nodes that are no longer members of the lockspace should be
ignored.  When nodes are removed from the lockspace, recovery can
sometimes complete quickly enough that messages arrive from a removed node
after recovery has completed.  When processed, these messages would often
cause an error message, and could in some cases change some state, causing
problems.

Signed-off-by: David Teigland <teigland>


From c54e04b00fe027da30ada5af76b6749772dd644a Mon Sep 17 00:00:00 2001
From: David Teigland <teigland>
Date: Wed, 9 Jan 2008 09:59:41 -0600
Subject: [PATCH] dlm: validate messages before processing

There was some hit and miss validation of messages that has now been
cleaned up and unified.  Before processing a message, the new
validate_message() function checks that the lkb is the appropriate type,
process-copy or master-copy, and that the message is from the correct
nodeid for the the given lkb.  Other checks and assertions on the
lkb type and nodeid have been removed.  The assertions were particularly
bad since they would panic the machine instead of just ignoring the bad
message.

Although other recent patches have made processing old message unlikely,
it still may be possible for an old message to be processed and caught
by these checks.

Signed-off-by: David Teigland <teigland>


From 42dc1601a9a31e8da767a4a9c37bad844b3698ab Mon Sep 17 00:00:00 2001
From: David Teigland <teigland>
Date: Wed, 9 Jan 2008 10:30:45 -0600
Subject: [PATCH] dlm: reject normal unlock when lock is waiting for lookup

Non-forced unlocks should be rejected if the lock is waiting on the
rsb_lookup list for another lock to establish the master node.

Signed-off-by: David Teigland <teigland>


From 755b5eb8bac90b35dc901465a06081aaad94e9ae Mon Sep 17 00:00:00 2001
From: David Teigland <teigland>
Date: Wed, 9 Jan 2008 10:37:39 -0600
Subject: [PATCH] dlm: limit dir lookup loop

In a rare case we may need to repeat a local resource directory lookup
due to a race with removing the rsb and removing the resdir record.
We'll never need to do more than a single additional lookup, though,
so the infinite loop around the lookup can be removed.  In addition
to being unnecessary, the infinite loop is dangerous since some other
unknown condition may appear causing the loop to never break.

Signed-off-by: David Teigland <teigland>


From ce5246b972f7514af899a63c0faf831d05ed5ee1 Mon Sep 17 00:00:00 2001
From: David Teigland <teigland>
Date: Mon, 14 Jan 2008 15:48:58 -0600
Subject: [PATCH] dlm: fix possible use-after-free

The dlm_put_lkb() can free the lkb and its associated ua structure,
so we can't depend on using the ua struct after the put.

Signed-off-by: David Teigland <teigland>


From 594199ebaae5d77f025974dfcfa6651cc81325a8 Mon Sep 17 00:00:00 2001
From: David Teigland <teigland>
Date: Wed, 16 Jan 2008 11:03:41 -0600
Subject: [PATCH] dlm: change error message to debug

The invalid lockspace messages are normal and can appear relatively
often.  They should be suppressed without debugging enabled.

Signed-off-by: David Teigland <teigland>


Comment 2 RHEL Program Management 2008-06-05 16:34:33 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Don Zickus 2008-07-16 15:48:18 UTC
in kernel-2.6.18-97.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 9 errata-xmlrpc 2009-01-20 20:22:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html