Bug 1379793

Summary: [RFE] better error when unable to grab VG lock on one node
Product: Red Hat Enterprise Linux 7 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: LVM and device-mapper development team <lvm-team>
lvm2 sub component: LVM lock daemon / lvmlockd QA Contact: cluster-qe <cluster-qe>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: low    
Priority: unspecified CC: agk, heinzm, jbrassow, prajnoha, teigland, zkabelac
Version: 7.3Keywords: FutureFeature
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-09 15:09:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Corey Marthaler 2016-09-27 16:44:05 UTC
Description of problem:
This can happen when attempting to act on many volumes within the VG from both nodes. One is able to do the tasks, the other fails with 'error -11'. Any easier way to cause this currently is to have a prompt waiting. 


[root@harding-03 ~]#  lvremove vg/`hostname`-2
Do you really want to remove active logical volume vg/harding-03.lab.msp.redhat.com-2? [y/n]: 

# While prompt is waiting with held lock...

[root@harding-02 ~]#  lvremove vg/`hostname`-2
  Retrying ex lock on VG vg
  Retrying ex lock on VG vg
  Retrying ex lock on VG vg
  VG vg lock failed: error -11


Version-Release number of selected component (if applicable):
3.10.0-510.el7.x86_64

lvm2-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
lvm2-libs-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
lvm2-cluster-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-1.02.134-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-libs-1.02.134-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-event-1.02.134-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-event-libs-1.02.134-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 05:29:13 CDT 2016
cmirror-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
sanlock-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
sanlock-lib-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
lvm2-lockd-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016

Comment 2 Corey Marthaler 2017-09-29 14:10:22 UTC
This is much easier to reproduce now that we've added more meta data change intensive testing for 7.5. This test keeps a continuous amount of meta operation commands running on different nodes in the cluster simultaneously, so it's not likely a "real world" type scenario.  


Sep 28 14:22:29 host-040 qarshd[21990]: Running cmdline: lvconvert --type mirror -m 1 lock_stress/host-040.16366

Up converting linear to a 1 redundant legged mirror on host-040
  Retrying ex lock on VG lock_stress
  Retrying ex lock on VG lock_stress
  Retrying ex lock on VG lock_stress
  VG lock_stress lock failed: error -11
couldn't up convert cmirror on host-040


3.10.0-693.2.1.el7.x86_64
lvm2-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-libs-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-cluster-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017
cmirror-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
sanlock-3.5.0-1.el7    BUILT: Wed Apr 26 09:37:30 CDT 2017
sanlock-lib-3.5.0-1.el7    BUILT: Wed Apr 26 09:37:30 CDT 2017
lvm2-lockd-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017

Comment 3 Corey Marthaler 2017-10-05 16:37:30 UTC
Some additional lvmpolld messages during the failure...

Creating a 1 redundant legged cmirror named host-034.15783
  Logical volume "host-034.15783" created.

Down converting cmirror from 1 legs to linear on host-034
  Retrying ex lock on VG lock_stress
  Logical volume lock_stress/host-034.15783 converted.

Up converting linear to a 3 redundant legged cmirror on host-034
  Retrying ex lock on VG lock_stress
  Logical volume lock_stress/host-034.15783 being converted.
  lock_stress/host-034.15783: Converted: 64.00%
  lock_stress/host-034.15783: Converted: 100.00%
  lock_stress/host-034.15783: Converted: 100.00%
  lock_stress/host-034.15783: Converted: 100.00%
  For more information see lvmpolld messages in syslog or lvmpolld log file.
  LVM command executed by lvmpolld failed.
couldn't up convert cmirror on host-034


Oct  5 00:32:49 host-034 qarshd[16606]: Running cmdline: lvs lock_stress/host-034.15763 --noheadings -o copy_percent | awk {'print $1'}
Oct  5 00:32:50 host-034 qarshd[16607]: Running cmdline: lvs lock_stress/host-034.15765 --noheadings -o copy_percent | awk {'print $1'}
Oct  5 00:32:55 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:01 host-034 lvmpolld: W: #011LVPOLL: PID 16592: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:07 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:13 host-034 lvmpolld: W: #011LVPOLL: PID 16592: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:19 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:24 host-034 systemd: Started qarsh Per-Connection Server (10.15.80.218:48300).
Oct  5 00:33:24 host-034 systemd: Starting qarsh Per-Connection Server (10.15.80.218:48300)...
Oct  5 00:33:24 host-034 qarshd[16639]: Talking to peer ::ffff:10.15.80.218:48300 (IPv6)
Oct  5 00:33:24 host-034 qarshd[16639]: Running cmdline: lvconvert --type mirror -m 3 --corelog lock_stress/host-034.15758
Oct  5 00:33:24 host-034 lvmpolld: W: #011LVPOLL: PID 16592: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:31 host-034 lvmpolld[16309]: LVMPOLLD: lvm2 cmd (PID 16516) failed (retcode: 5)
Oct  5 00:33:31 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: '  VG lock_stress lock failed: error -11'
Oct  5 00:33:31 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR:   ABORTING: Can't lock VG for lock_stress/host-034.15783.
Oct  5 00:33:34 host-034 qarshd[16229]: Sending child 16231 signal 2
Oct  5 00:33:34 host-034 qarshd[16639]: Sending child 16640 signal 2

Comment 5 David Teigland 2020-09-09 15:09:32 UTC
At some point error -11 was changed to "VG %s lock failed: held by other host."