Bug 450132
Summary: | dlm: fixes for recovery of user lockspace | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | David Teigland <teigland> |
Component: | kernel | Assignee: | Don Zickus <dzickus> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 5.2 | CC: | ccaulfie, edamato, lwang |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-01-20 20:22:03 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Teigland
2008-06-05 14:46:48 UTC
fixed in nine upstream commits: From 8a358ca8e738b6226b004efea462ac28c0a2bbb1 Mon Sep 17 00:00:00 2001 From: David Teigland <teigland> Date: Mon, 7 Jan 2008 15:55:18 -0600 Subject: [PATCH] dlm: clear ast_type when removing from astqueue The lkb_ast_type field indicates whether the lkb is on the astqueue list. When clearing locks for a process, lkb's were being removed from the astqueue list without clearing the field. If release_lockspace then happened immediately afterward, it could try to remove the lkb from the list a second time. Appears when process calls libdlm dlm_release_lockspace() which first closes the ls dev triggering clear_proc_locks, and then removes the ls (a write to control dev) causing release_lockspace(). Signed-off-by: David Teigland <teigland> From 601342ce022b964f756b67f2eb99b605c1afa3ed Mon Sep 17 00:00:00 2001 From: David Teigland <teigland> Date: Mon, 7 Jan 2008 16:15:05 -0600 Subject: [PATCH] dlm: recover locks waiting for overlap replies When recovery looks at locks waiting for replies, it fails to consider locks that have already received a reply for their first remote operation, but not received a reply for secondary, overlapping unlock/cancel. The appropriate stub reply needs to be called for these waiters. Appears when we start doing recovery in the presence of a many overlapping unlock/cancel ops. Signed-off-by: David Teigland <teigland> From aec64e1be2225c6fc64499594d23257c6adf6168 Mon Sep 17 00:00:00 2001 From: David Teigland <teigland> Date: Tue, 8 Jan 2008 15:37:47 -0600 Subject: [PATCH] dlm: another call to confirm_master in receive_request_reply When a failed request (EBADR or ENOTBLK) is unlocked/canceled instead of retried, there may be other lkb's waiting on the rsb_lookup list for it to complete. A call to confirm_master() is needed to move on to the next waiting lkb since the current one won't be retried. Signed-off-by: David Teigland <teigland> From 46b43eed7018bab3a4e8c259eed27697b9170cb8 Mon Sep 17 00:00:00 2001 From: David Teigland <teigland> Date: Tue, 8 Jan 2008 16:24:00 -0600 Subject: [PATCH] dlm: reject messages from non-members Messages from nodes that are no longer members of the lockspace should be ignored. When nodes are removed from the lockspace, recovery can sometimes complete quickly enough that messages arrive from a removed node after recovery has completed. When processed, these messages would often cause an error message, and could in some cases change some state, causing problems. Signed-off-by: David Teigland <teigland> From c54e04b00fe027da30ada5af76b6749772dd644a Mon Sep 17 00:00:00 2001 From: David Teigland <teigland> Date: Wed, 9 Jan 2008 09:59:41 -0600 Subject: [PATCH] dlm: validate messages before processing There was some hit and miss validation of messages that has now been cleaned up and unified. Before processing a message, the new validate_message() function checks that the lkb is the appropriate type, process-copy or master-copy, and that the message is from the correct nodeid for the the given lkb. Other checks and assertions on the lkb type and nodeid have been removed. The assertions were particularly bad since they would panic the machine instead of just ignoring the bad message. Although other recent patches have made processing old message unlikely, it still may be possible for an old message to be processed and caught by these checks. Signed-off-by: David Teigland <teigland> From 42dc1601a9a31e8da767a4a9c37bad844b3698ab Mon Sep 17 00:00:00 2001 From: David Teigland <teigland> Date: Wed, 9 Jan 2008 10:30:45 -0600 Subject: [PATCH] dlm: reject normal unlock when lock is waiting for lookup Non-forced unlocks should be rejected if the lock is waiting on the rsb_lookup list for another lock to establish the master node. Signed-off-by: David Teigland <teigland> From 755b5eb8bac90b35dc901465a06081aaad94e9ae Mon Sep 17 00:00:00 2001 From: David Teigland <teigland> Date: Wed, 9 Jan 2008 10:37:39 -0600 Subject: [PATCH] dlm: limit dir lookup loop In a rare case we may need to repeat a local resource directory lookup due to a race with removing the rsb and removing the resdir record. We'll never need to do more than a single additional lookup, though, so the infinite loop around the lookup can be removed. In addition to being unnecessary, the infinite loop is dangerous since some other unknown condition may appear causing the loop to never break. Signed-off-by: David Teigland <teigland> From ce5246b972f7514af899a63c0faf831d05ed5ee1 Mon Sep 17 00:00:00 2001 From: David Teigland <teigland> Date: Mon, 14 Jan 2008 15:48:58 -0600 Subject: [PATCH] dlm: fix possible use-after-free The dlm_put_lkb() can free the lkb and its associated ua structure, so we can't depend on using the ua struct after the put. Signed-off-by: David Teigland <teigland> From 594199ebaae5d77f025974dfcfa6651cc81325a8 Mon Sep 17 00:00:00 2001 From: David Teigland <teigland> Date: Wed, 16 Jan 2008 11:03:41 -0600 Subject: [PATCH] dlm: change error message to debug The invalid lockspace messages are normal and can appear relatively often. They should be suppressed without debugging enabled. Signed-off-by: David Teigland <teigland> This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. patches posted to rhkernel-list http://post-office.corp.redhat.com/archives/rhkernel-list/2008-June/msg00079.html http://post-office.corp.redhat.com/archives/rhkernel-list/2008-June/msg00080.html http://post-office.corp.redhat.com/archives/rhkernel-list/2008-June/msg00081.html http://post-office.corp.redhat.com/archives/rhkernel-list/2008-June/msg00082.html http://post-office.corp.redhat.com/archives/rhkernel-list/2008-June/msg00083.html http://post-office.corp.redhat.com/archives/rhkernel-list/2008-June/msg00084.html http://post-office.corp.redhat.com/archives/rhkernel-list/2008-June/msg00085.html http://post-office.corp.redhat.com/archives/rhkernel-list/2008-June/msg00086.html http://post-office.corp.redhat.com/archives/rhkernel-list/2008-June/msg00087.html in kernel-2.6.18-97.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |