Bug 1379793 - [RFE] better error when unable to grab VG lock on one node
Summary: [RFE] better error when unable to grab VG lock on one node
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lvm2
Version: 7.3
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: rc
: ---
Assignee: LVM and device-mapper development team
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-27 16:44 UTC by Corey Marthaler
Modified: 2021-09-03 12:40 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-09 15:09:32 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Corey Marthaler 2016-09-27 16:44:05 UTC
Description of problem:
This can happen when attempting to act on many volumes within the VG from both nodes. One is able to do the tasks, the other fails with 'error -11'. Any easier way to cause this currently is to have a prompt waiting. 


[root@harding-03 ~]#  lvremove vg/`hostname`-2
Do you really want to remove active logical volume vg/harding-03.lab.msp.redhat.com-2? [y/n]: 

# While prompt is waiting with held lock...

[root@harding-02 ~]#  lvremove vg/`hostname`-2
  Retrying ex lock on VG vg
  Retrying ex lock on VG vg
  Retrying ex lock on VG vg
  VG vg lock failed: error -11


Version-Release number of selected component (if applicable):
3.10.0-510.el7.x86_64

lvm2-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
lvm2-libs-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
lvm2-cluster-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-1.02.134-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-libs-1.02.134-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-event-1.02.134-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-event-libs-1.02.134-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 05:29:13 CDT 2016
cmirror-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016
sanlock-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
sanlock-lib-3.4.0-1.el7    BUILT: Fri Jun 10 11:41:03 CDT 2016
lvm2-lockd-2.02.165-4.el7    BUILT: Thu Sep 22 01:47:19 CDT 2016

Comment 2 Corey Marthaler 2017-09-29 14:10:22 UTC
This is much easier to reproduce now that we've added more meta data change intensive testing for 7.5. This test keeps a continuous amount of meta operation commands running on different nodes in the cluster simultaneously, so it's not likely a "real world" type scenario.  


Sep 28 14:22:29 host-040 qarshd[21990]: Running cmdline: lvconvert --type mirror -m 1 lock_stress/host-040.16366

Up converting linear to a 1 redundant legged mirror on host-040
  Retrying ex lock on VG lock_stress
  Retrying ex lock on VG lock_stress
  Retrying ex lock on VG lock_stress
  VG lock_stress lock failed: error -11
couldn't up convert cmirror on host-040


3.10.0-693.2.1.el7.x86_64
lvm2-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-libs-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-cluster-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017
cmirror-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
sanlock-3.5.0-1.el7    BUILT: Wed Apr 26 09:37:30 CDT 2017
sanlock-lib-3.5.0-1.el7    BUILT: Wed Apr 26 09:37:30 CDT 2017
lvm2-lockd-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017

Comment 3 Corey Marthaler 2017-10-05 16:37:30 UTC
Some additional lvmpolld messages during the failure...

Creating a 1 redundant legged cmirror named host-034.15783
  Logical volume "host-034.15783" created.

Down converting cmirror from 1 legs to linear on host-034
  Retrying ex lock on VG lock_stress
  Logical volume lock_stress/host-034.15783 converted.

Up converting linear to a 3 redundant legged cmirror on host-034
  Retrying ex lock on VG lock_stress
  Logical volume lock_stress/host-034.15783 being converted.
  lock_stress/host-034.15783: Converted: 64.00%
  lock_stress/host-034.15783: Converted: 100.00%
  lock_stress/host-034.15783: Converted: 100.00%
  lock_stress/host-034.15783: Converted: 100.00%
  For more information see lvmpolld messages in syslog or lvmpolld log file.
  LVM command executed by lvmpolld failed.
couldn't up convert cmirror on host-034


Oct  5 00:32:49 host-034 qarshd[16606]: Running cmdline: lvs lock_stress/host-034.15763 --noheadings -o copy_percent | awk {'print $1'}
Oct  5 00:32:50 host-034 qarshd[16607]: Running cmdline: lvs lock_stress/host-034.15765 --noheadings -o copy_percent | awk {'print $1'}
Oct  5 00:32:55 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:01 host-034 lvmpolld: W: #011LVPOLL: PID 16592: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:07 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:13 host-034 lvmpolld: W: #011LVPOLL: PID 16592: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:19 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:24 host-034 systemd: Started qarsh Per-Connection Server (10.15.80.218:48300).
Oct  5 00:33:24 host-034 systemd: Starting qarsh Per-Connection Server (10.15.80.218:48300)...
Oct  5 00:33:24 host-034 qarshd[16639]: Talking to peer ::ffff:10.15.80.218:48300 (IPv6)
Oct  5 00:33:24 host-034 qarshd[16639]: Running cmdline: lvconvert --type mirror -m 3 --corelog lock_stress/host-034.15758
Oct  5 00:33:24 host-034 lvmpolld: W: #011LVPOLL: PID 16592: STDERR: '  Retrying ex lock on VG lock_stress'
Oct  5 00:33:31 host-034 lvmpolld[16309]: LVMPOLLD: lvm2 cmd (PID 16516) failed (retcode: 5)
Oct  5 00:33:31 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR: '  VG lock_stress lock failed: error -11'
Oct  5 00:33:31 host-034 lvmpolld: W: #011LVPOLL: PID 16516: STDERR:   ABORTING: Can't lock VG for lock_stress/host-034.15783.
Oct  5 00:33:34 host-034 qarshd[16229]: Sending child 16231 signal 2
Oct  5 00:33:34 host-034 qarshd[16639]: Sending child 16640 signal 2

Comment 5 David Teigland 2020-09-09 15:09:32 UTC
At some point error -11 was changed to "VG %s lock failed: held by other host."


Note You need to log in before you can comment on or make changes to this bug.