Hide Forgot
Description of problem: This can happen when attempting to act on many volumes within the VG from both nodes. One is able to do the tasks, the other fails with 'error -11'. Any easier way to cause this currently is to have a prompt waiting. [root@harding-03 ~]# lvremove vg/`hostname`-2 Do you really want to remove active logical volume vg/harding-03.lab.msp.redhat.com-2? [y/n]: # While prompt is waiting with held lock... [root@harding-02 ~]# lvremove vg/`hostname`-2 Retrying ex lock on VG vg Retrying ex lock on VG vg Retrying ex lock on VG vg VG vg lock failed: error -11 Version-Release number of selected component (if applicable): 3.10.0-510.el7.x86_64 lvm2-2.02.165-4.el7 BUILT: Thu Sep 22 01:47:19 CDT 2016 lvm2-libs-2.02.165-4.el7 BUILT: Thu Sep 22 01:47:19 CDT 2016 lvm2-cluster-2.02.165-4.el7 BUILT: Thu Sep 22 01:47:19 CDT 2016 device-mapper-1.02.134-4.el7 BUILT: Thu Sep 22 01:47:19 CDT 2016 device-mapper-libs-1.02.134-4.el7 BUILT: Thu Sep 22 01:47:19 CDT 2016 device-mapper-event-1.02.134-4.el7 BUILT: Thu Sep 22 01:47:19 CDT 2016 device-mapper-event-libs-1.02.134-4.el7 BUILT: Thu Sep 22 01:47:19 CDT 2016 device-mapper-persistent-data-0.6.3-1.el7 BUILT: Fri Jul 22 05:29:13 CDT 2016 cmirror-2.02.165-4.el7 BUILT: Thu Sep 22 01:47:19 CDT 2016 sanlock-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 sanlock-lib-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 lvm2-lockd-2.02.165-4.el7 BUILT: Thu Sep 22 01:47:19 CDT 2016
This is much easier to reproduce now that we've added more meta data change intensive testing for 7.5. This test keeps a continuous amount of meta operation commands running on different nodes in the cluster simultaneously, so it's not likely a "real world" type scenario. Sep 28 14:22:29 host-040 qarshd[21990]: Running cmdline: lvconvert --type mirror -m 1 lock_stress/host-040.16366 Up converting linear to a 1 redundant legged mirror on host-040 Retrying ex lock on VG lock_stress Retrying ex lock on VG lock_stress Retrying ex lock on VG lock_stress VG lock_stress lock failed: error -11 couldn't up convert cmirror on host-040 3.10.0-693.2.1.el7.x86_64 lvm2-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 lvm2-libs-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 lvm2-cluster-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-libs-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-event-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-event-libs-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-persistent-data-0.7.0-0.1.rc6.el7 BUILT: Mon Mar 27 10:15:46 CDT 2017 cmirror-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 sanlock-3.5.0-1.el7 BUILT: Wed Apr 26 09:37:30 CDT 2017 sanlock-lib-3.5.0-1.el7 BUILT: Wed Apr 26 09:37:30 CDT 2017 lvm2-lockd-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017
Some additional lvmpolld messages during the failure... Creating a 1 redundant legged cmirror named host-034.15783 Logical volume "host-034.15783" created. Down converting cmirror from 1 legs to linear on host-034 Retrying ex lock on VG lock_stress Logical volume lock_stress/host-034.15783 converted. Up converting linear to a 3 redundant legged cmirror on host-034 Retrying ex lock on VG lock_stress Logical volume lock_stress/host-034.15783 being converted. lock_stress/host-034.15783: Converted: 64.00% lock_stress/host-034.15783: Converted: 100.00% lock_stress/host-034.15783: Converted: 100.00% lock_stress/host-034.15783: Converted: 100.00% For more information see lvmpolld messages in syslog or lvmpolld log file. LVM command executed by lvmpolld failed. couldn't up convert cmirror on host-034 Oct 5 00:32:49 host-034 qarshd[16606]: Running cmdline: lvs lock_stress/host-034.15763 --noheadings -o copy_percent | awk {'print $1'} Oct 5 00:32:50 host-034 qarshd[16607]: Running cmdline: lvs lock_stress/host-034.15765 --noheadings -o copy_percent | awk {'print $1'} Oct 5 00:32:55 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: ' Retrying ex lock on VG lock_stress' Oct 5 00:33:01 host-034 lvmpolld: W: #011LVPOLL: PID 16592: STDERR: ' Retrying ex lock on VG lock_stress' Oct 5 00:33:07 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: ' Retrying ex lock on VG lock_stress' Oct 5 00:33:13 host-034 lvmpolld: W: #011LVPOLL: PID 16592: STDERR: ' Retrying ex lock on VG lock_stress' Oct 5 00:33:19 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: ' Retrying ex lock on VG lock_stress' Oct 5 00:33:24 host-034 systemd: Started qarsh Per-Connection Server (10.15.80.218:48300). Oct 5 00:33:24 host-034 systemd: Starting qarsh Per-Connection Server (10.15.80.218:48300)... Oct 5 00:33:24 host-034 qarshd[16639]: Talking to peer ::ffff:10.15.80.218:48300 (IPv6) Oct 5 00:33:24 host-034 qarshd[16639]: Running cmdline: lvconvert --type mirror -m 3 --corelog lock_stress/host-034.15758 Oct 5 00:33:24 host-034 lvmpolld: W: #011LVPOLL: PID 16592: STDERR: ' Retrying ex lock on VG lock_stress' Oct 5 00:33:31 host-034 lvmpolld[16309]: LVMPOLLD: lvm2 cmd (PID 16516) failed (retcode: 5) Oct 5 00:33:31 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: ' VG lock_stress lock failed: error -11' Oct 5 00:33:31 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: ABORTING: Can't lock VG for lock_stress/host-034.15783. Oct 5 00:33:34 host-034 qarshd[16229]: Sending child 16231 signal 2 Oct 5 00:33:34 host-034 qarshd[16639]: Sending child 16640 signal 2
At some point error -11 was changed to "VG %s lock failed: held by other host."