1448574 – lvm2 deadlock with pvcreate & vgremove running concurrently

Bug 1448574 - lvm2 deadlock with pvcreate & vgremove running concurrently

Summary: lvm2 deadlock with pvcreate & vgremove running concurrently

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	LVM and device-mapper
Classification:	Community
Component:	lvm2
Sub Component:
Version:	2.02.168
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	David Teigland
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-05 19:27 UTC by Tony Asleson
Modified:	2019-08-08 16:48 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-08-08 16:48:49 UTC
Embargoed:
Flags:	rule-engine: lvm-technical-solution? rule-engine: lvm-test-coverage?

Attachments	(Terms of Use)

Description Tony Asleson 2017-05-05 19:27:15 UTC

Description of problem:

When running multiple copies of the lvmdbustest.py unit test concurrently against the lvmdbusd
service everything hangs quite quickly. When looking at the process list to see whats going on
I discovered:

root     27603  1681  4 13:23 pts/0    00:00:01 python3 ./lvmdbusd --debug
root     27613  1801  0 13:23 pts/1    00:00:00 python3 ./lvmdbustest.py -v
root     27806 20917  0 13:23 pts/2    00:00:00 python3 ./lvmdbustest.py -v
root     27826  2395  0 13:23 pts/3    00:00:00 python3 ./lvmdbustest.py -v
root     29428 27613  0 13:23 pts/1    00:00:00 /usr/sbin/lvm pvcreate /dev/sdc
root     29430 27603  1 13:23 pts/0    00:00:00 /usr/sbin/lvm vgremove -f qtapifkq_vg --config global/notify_dbus=0

Then I did:
# kill -2 29430 29428

I then discovered:

The pvcreate exited with (This was executed directly from the unit test case):
pvcreate failed with stdout= , stderr=   
Interrupted...
  Giving up waiting for lock.
  /run/lock/lvm/V_qtapifkq_vg: flock failed: Interrupted system call
  Can't get lock for qtapifkq_vg
  Cannot process volume group qtapifkq_vg
  Interrupted...
  Interrupted...
  Device /dev/sdc not found (or ignored by filtering).


And the vgremove exited with (This was executed in the context of the lvm dbus service):

27603:27609 - EC= 5 for ['/usr/sbin/lvm', 'vgremove', '-f', 'qtapifkq_vg', '--config', 'global/notify_dbus=0']
STARTED: 1494008624.489977, ENDED: 1494008665.708364
STDOUT=  Logical volume "alheatvy_lv" successfully removed

STDERR=  Interrupted...
  Giving up waiting for lock.
  /run/lock/lvm/P_orphans: flock failed: Interrupted system call
  Can't get lock for orphan PVs

It appears that the pvcreate is waiting for lock:

/run/lock/lvm/V_qtapifkq_vg

And vgremove is waiting for lock:

/run/lock/lvm/P_orphans

My best guess is the pvcreate is holding the /run/lock/lvm/P_orphans lock and vgremove is 
holding the /run/lock/lvm/V_qtapifkq_vg, thus they are deadlocked on one another.

Processes which need to acquire multiple locks need to acquire them in the same order.

Version-Release number of selected component (if applicable):
# lvm version
  LVM version:     2.02.168(2) (2016-11-30)
  Library version: 1.02.137 (2016-11-30)
  Driver version:  4.35.0

How reproducible:
100%


Steps to Reproduce:
1. Start the lvmdbusd daemon, # ./lvmdbusd --debug
2. Fire up a few lvmdbustest.py instances:
   # LVM_DBUSD_PV_DEVICE_LIST=/dev/sdb:/dev/sdc:/dev/sdd:/dev/sde ./lvmdbustest.py -v > /tmp/unit1 2>&1 &
   # LVM_DBUSD_PV_DEVICE_LIST=/dev/sdf:/dev/sdg:/dev/sdh:/dev/sdi ./lvmdbustest.py -v > /tmp/unit2 2>&1 &
   # LVM_DBUSD_PV_DEVICE_LIST=/dev/sdj:/dev/sdk:/dev/sdl:/dev/sdm ./lvmdbustest.py -v > /tmp/unit3 2>&1 &

Actual results:
Everything comes to a grinding halt


Expected results:
Unit tests run to completion


Additional info:

Likely only need 2 concurrent unit tests to re-create, but haven't tested that scenario.

Comment 1 David Teigland 2017-05-05 19:46:23 UTC

This commit is partially responsible.  I don't know about the "policy of non-blocking second locks" that is referred to.  We should probably move the orphan lock back to the top level regardless, since that is the right place.  The liblvm concept which motivated this commit didn't work out, and we are fixing problems that it introduced to the rest of lvm.

https://sourceware.org/git/?p=lvm2.git;a=commit;h=03660b30454ec07c09c164f7fd23af278677324a

commit 03660b30454ec07c09c164f7fd23af278677324a
Author: Dave Wysochanski <dwysocha>
Date:   Fri Jul 10 20:08:37 2009 +0000

    Move orphan lock inside vg_remove_single.
    
    Move the vg orphan lock inside vg_remove_single, now a complete liblvm
    function.  Note that this changes the order of the locks - originally
    VG_ORPHAN was obtained first, then the vgname lock.  With the current
    policy of non-blocking second locks, this could mean we get a failure
    obtaining the orphan lock.  In the case of a vg with lvs being removed,
    this could result in the lvs being removed but not the vg.  Such a
    scenario could have happened prior though with a different failure.
    Other tools were examined for side-effects, and no major problems
    were noted.

Comment 2 Alasdair Kergon 2017-05-09 11:42:34 UTC

"policy of non-blocking second locks" refers to the fact that if one lock is already held and a second lock is requested, the code never blocks waiting for that lock - if it can't obtain it immediately it fails and the caller drops the first lock before either aborting the operation or retrying.

Secondly the locks were supposed to be ordered, with the orphans lock consistently always acquired either before or after the VG locks (I think the order got swapped at one point in the code's history), and the VG locks are always obtained in alphabetical order of VG name.

We need to audit the code again to determine what the current situation has become and decide how to reimpose consistency.

Comment 3 Alasdair Kergon 2017-05-09 11:54:14 UTC

Some quick tests suggest the code is now breaking both of those rules.  I'm seeing examples where the ordering is going wrong and examples where blocking locks are requested that should have been non-blocking, so it's not surprising that a deadlock was discovered.

The current rules to which the code is supposed to adhere are documented in locking.h:

 *   If more than one lock needs to be held simultaneously, they must be
 *   acquired in alphabetical order of 'vol' (to avoid deadlocks), with
 *   VG_ORPHANS last.

Comment 4 Alasdair Kergon 2017-05-09 12:01:22 UTC

So we should begin by finding all the places that breach the current rules and then understand why the rules weren't followed and reassess whether they can all be brought back into line or whether the rules need to be modified again.

Comment 5 Alasdair Kergon 2017-05-09 12:29:01 UTC

The last time the code was audited, the ordering was adhered to everywhere so the non-blocking second lock code got removed, specifically because the orphan lock was now being taken last (used to be taken first) and had no retry code leading to avoidable failures.

commit d7cbaae1fd5bdf1e013d769a0a828ef9d04fb8ef
Author: Milan Broz <mbroz>
Date:   Wed Mar 31 17:23:56 2010 +0000

    Always use blocking lock for VGs and orphan locks.
    
    Because we have now strong rule for lock ordering:
     - VG locks must be taken in alphabetical order
     - ORPHAN locks must be the last
    vgs_locked() is now not needed.

So:
  1) find all the places that break the current lock ordering rules in locking.h
  2) work out if it's reasonable to fix them to adhere to those rules
  3) if not, then temporarily reinstate the "2nd lock is non-blocking" code, albeit reintroducing the problem its removal fixed, until a new long-term solution is found

Comment 6 David Teigland 2019-08-08 16:41:08 UTC

This is a design problem in lvm locking (using two different "global" locks, and using them inconsistently: GLOBAL and ORPHAN).  It is fixed in

https://sourceware.org/git/?p=lvm2.git;a=commit;h=8c87dda195ffadcce1e428d3481e8d01080e2b22

That change is not suitable for a backporting to 2.02 releases.

Note You need to log in before you can comment on or make changes to this bug.