Bug 444608

Summary:	vgsplit failure due to locking errors
Product:	Red Hat Enterprise Linux 4	Reporter:	Corey Marthaler <cmarthal>
Component:	lvm2	Assignee:	Milan Broz <mbroz>
Status:	CLOSED ERRATA	QA Contact:	Corey Marthaler <cmarthal>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.0	CC:	agk, ccaulfie, dwysocha, edamato, heinzm, jbrassow, mbroz, prockai, pvrabec
Target Milestone:	rc	Keywords:	Regression
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2008-0776	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-07-24 20:08:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	450474
Bug Blocks:

Description Corey Marthaler 2008-04-29 14:58:45 UTC

Description of problem:
There seems to be some kind of timing issue regression with vgsplits. Sometimes
the cmd will work and sometimes the exact cmd will not.

[root@grant-03 lvm]# vgchange -an linear_8_1953
  0 logical volume(s) in volume group "linear_8_1953" now active

[root@grant-03 lvm]# lvs -a -o +devices
  LV             VG            Attr   LSize  Origin Snap%  Move Log Copy% 
Convert Devices
  LogVol00       VolGroup00    -wi-ao 72.34G                                   
   /dev/sda2(0)
  LogVol01       VolGroup00    -wi-ao  1.94G                                   
   /dev/sda2(2315)
  linear_8_19530 linear_8_1953 -wim--  1.44T                                   
   /dev/sdd1(0)
  linear_8_19530 linear_8_1953 -wim--  1.44T                                   
   /dev/sdd2(0)
  linear_8_19530 linear_8_1953 -wim--  1.44T                                   
   /dev/sdd3(0)
  linear_8_19530 linear_8_1953 -wim--  1.44T                                   
   /dev/sdd4(0)
  linear_8_19530 linear_8_1953 -wim--  1.44T                                   
   /dev/sdb4(0)
  linear_8_19530 linear_8_1953 -wim--  1.44T                                   
   /dev/sdb1(0)
  linear_8_19530 linear_8_1953 -wim--  1.44T                                   
   /dev/sdb2(0)
  linear_8_19530 linear_8_1953 -wim--  1.44T                                   
   /dev/sdb3(0)

[root@grant-03 lvm]# vgsplit linear_8_1953 split_777 /dev/sdb1 /dev/sdb2
/dev/sdb3 /dev/sdb4 /dev/sdd1 /dev/sdd2 /dev/sdd3 /dev/sdd4
  Error locking on node grant-03: Volume group for uuid not found:
BWegvBBwJjJTw8hoTqsqPHMaZafd6Ua3Mt0yjPbKHYrOGmlAgTAZzx4ycepvowTd
  Logical volume "linear_8_19530" must be inactive

[ *after about 5 minutes* ]

[root@grant-03 lvm]# vgsplit linear_8_1953 split_777 /dev/sdb1 /dev/sdb2
/dev/sdb3 /dev/sdb4 /dev/sdd1 /dev/sdd2 /dev/sdd3 /dev/sdd4
  New volume group "split_777" successfully split from "linear_8_1953"

# it fails right away when attempting it again.

[root@grant-03 lvm]# vgsplit split_777 linear_8_1953 /dev/sdb1 /dev/sdb2
/dev/sdb3 /dev/sdb4 /dev/sdd1 /dev/sdd2 /dev/sdd3 /dev/sdd4
  Error locking on node grant-03: Volume group for uuid not found:
2l0KXkdTQMeosoEBYSc65rW1qpctLRTAMt0yjPbKHYrOGmlAgTAZzx4ycepvowTd
  Logical volume "linear_8_19530" must be inactive

# after a vgscan, it's works:

[root@grant-03 lvm]# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "split_777" using metadata type lvm2
  Found volume group "VolGroup00" using metadata type lvm2
  Device '/dev/sda2' has been left open.
  Device '/dev/sda2' has been left open.
[root@grant-03 lvm]# vgsplit split_777 linear_8_1953 /dev/sdb1 /dev/sdb2
/dev/sdb3 /dev/sdb4 /dev/sdd1 /dev/sdd2 /dev/sdd3 /dev/sdd4
  New volume group "linear_8_1953" successfully split from "split_777"


Version-Release number of selected component (if applicable):
2.6.9-68.26.ELsmp
lvm2-2.02.35-1.el4
lvm2-cluster-2.02.35-1.el4

Comment 1 Corey Marthaler 2008-04-30 22:19:28 UTC

This bug exists in the new 4.7 rpms as well.

[lvm_cluster_config] VOLUME SPLIT split_708 back into linear_6_3960 on grant-01
[lvm_cluster_config]   Error locking on node grant-01: Volume group for uuid not
found: jWXAJg36Jvf4EbQuhgWMzxbzRFxjiUmUH05ERTuAC0fG3I9CsTWhAeGRlXctcDUm
[lvm_cluster_config]   Logical volume "linear_6_39600" must be inactive
[lvm_cluster_config] vgsplit failed:
[lvm_cluster_config] qarsh root@grant-01 vgsplit split_708 linear_6_3960
/dev/sdb1 /dev/sdb2 /dev/sdb3 /dev/sdb5 /dev/sdb6 /dev/sdb7

lvm2-2.02.36-1.el4
lvm2-cluster-2.02.36-1.el4

Comment 2 Dave Wysochanski 2008-05-01 13:43:13 UTC

[root@grant-03 ~]# rpm -q cman cman-kernel dlm dlm-kernel lvm2 lvm2-cluster
device-mapper ccs rgmanager
cman-1.0.23-1
cman-kernel-2.6.9-55.4
dlm-1.0.7-1
dlm-kernel-2.6.9-53.3
lvm2-2.02.36-1.el4
lvm2-cluster-2.02.36-1.el4
device-mapper-1.02.25-1.el4
ccs-1.0.12-1
rgmanager-1.9.76-1

Comment 3 Dave Wysochanski 2008-05-01 16:51:23 UTC

I'm working on reproducing this but so far no luck.  I have a 3-node xen cluster
with the above RPMs installed.  I have a volume group comprised of 5 multipath
devices (iscsi).

The failure message "Volume group for uuid not found" comes from the following
snippit in lv_from_lvid():
        if (!(vg = _vg_read_by_vgid(cmd, (char *)lvid->id[0].uuid, precommitted))) {
                log_error("Volume group for uuid not found: %s", lvid_s);
                return NULL;
        }

Comment 4 Dave Wysochanski 2008-05-01 17:00:35 UTC

Interestingly, I just saw the failure message, but with a different sequence.
1) start the cluster services on all nodes.  At this point, the volume group was
already created, so I did not do a "vgcreate"
2) lvcreate of a linear LV
3) vgsplit
4) lvremove

#4 is where the message showed up.  Running the lvremove repeatedly fails.  Note
that no node has the lv active.

[root@rhel4u5-node1 ~]# lvremove vg1/lv0linear
  Error locking on node rhel4u5-node1: Volume group for uuid not found:
gt6JWrWklmJMDNSovETTSjABgCaiXGGXbQyTXsetoNQow0gH3dLkwYQDQCpc54uc
Logical volume "lv0linear" is active on other cluster nodes.  Really remove? [y/n]: 

Just as in corey's case, a vgscan solves the problem.

Comment 5 Dave Wysochanski 2008-05-01 19:03:15 UTC

I can reproduce corey's failure now.  "clvmd -R" fixes the problem as well. 
Might be a caching problem.

Comment 6 Dave Wysochanski 2008-05-01 22:18:20 UTC

This looks to be a more generic caching problem.  We are debugging it now. 
Another sequence that will trigger the locking error is a vgcreate followed by
an lvcreate.  The lvcreate fails with a similar locking error.

Comment 15 Milan Broz 2008-06-09 20:08:51 UTC

Fixed in lvm2-2.02.37-1.el4

Comment 17 Corey Marthaler 2008-06-10 20:39:51 UTC

There looks to still be issues when vgspliting on cluster mirrors, I'll mark
this bz verified and open another one for that issue.

Comment 19 errata-xmlrpc 2008-07-24 20:08:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0776.html