From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050719 Red Hat/1.0.6-1.4.1 Firefox/1.0.6 Description of problem: Basically CLVM was setup on one of the nodes, and then another node came up and it was getting a connect() failed error. the only way to resovle was to set the locking type to one. See the output from a run of lvcreate on twedldum... Processing: lvcreate -vvv -s --permission r -n var_snapshot -L 200M /dev/vg0/var O_DIRECT will be used Setting global/locking_type to 2 Setting global/locking_library to /lib/liblvm2clusterlock.so Opening shared locking library /lib/liblvm2clusterlock.so Loaded external locking library /lib/liblvm2clusterlock.so External locking enabled. Setting chunksize to 16 sectors. Getting target version for snapshot dm version O dm versions O Getting target version for snapshot-origin dm versions O Locking V_vg0 at 0x4 Finding volume group "vg0" Opened /dev/md0 RW /dev/md0: block size is 1024 bytes /dev/md0: No label detected Closed /dev/md0 Opened /dev/etherd/e0.0 RW /dev/etherd/e0.0: block size is 4096 bytes /dev/etherd/e0.0: lvm2 label detected Closed /dev/etherd/e0.0 lvmcache: /dev/etherd/e0.0 now orphaned Opened /dev/etherd/e0.0 RW /dev/etherd/e0.0: block size is 4096 bytes Closed /dev/etherd/e0.0 lvmcache: /dev/etherd/e0.0 now in VG alice Opened /dev/md1 RW /dev/md1: block size is 4096 bytes /dev/md1: lvm2 label detected Closed /dev/md1 lvmcache: /dev/md1 now orphaned Opened /dev/md1 RW /dev/md1: block size is 4096 bytes Closed /dev/md1 lvmcache: /dev/md1 now in VG vg0 Opened /dev/md1 RW /dev/md1: block size is 4096 bytes /dev/md1: lvm2 label detected Closed /dev/md1 Opened /dev/md1 RW /dev/md1: block size is 4096 bytes Closed /dev/md1 Opened /dev/md1 RW /dev/md1: block size is 4096 bytes /dev/md1: lvm2 label detected Read vg0 metadata (41) from /dev/md1 at 108544 size 2712 Closed /dev/md1 Rounding up size to full physical extent 224.00 MB Creating logical volume var_snapshot Allowing allocation on /dev/md1 start PE 1200 length 21 Archiving volume group "vg0" metadata. Opened /dev/md1 RW /dev/md1: block size is 4096 bytes Writing vg0 metadata to /dev/md1 at 111616 len 2951 Creating volume group backup "/etc/lvm/backup/vg0" Writing vg0 metadata to /etc/lvm/backup/.lvm_twedldum.yewess.us_24881_499525191 Committing vg0 metadata (42) Renaming /etc/lvm/backup/vg0.tmp to /etc/lvm/backup/vg0 Committing vg0 metadata (42) to /dev/md1 header at 2048 Closed /dev/md1 Locking zxDALJyxHmoZQ6qxuho4QMfvZuqU9GbuqAh32luObinFHZg1Cm2LbPjx0rtGep2X at 0x19 Error locking on node twedldee.yewess.us: Internal lvm error, check syslog Aborting. Failed to activate snapshot exception store. Remove new LV and retry. Locking V_vg0 at 0x6 However, on twedldum- vgdisplay -vvv shows: vg0 UUID: zxDALJ-yxHm-oZQ6-qxuh-o4QM-fvZu-qU9Gbu vg0/var_snapshot UUID: qAh32l-uObi-nFHZ-g1Cm-2LbP-jx0r-tGep2X LVM2 is trying to lock /vg0/var_snapshot on twedldee (the other node)! This is impossible because it is a local disk. Twedldee gets the UUID and says "WTF is that" return a locking failure. We don't see the error because it would be generated on the other node. What's happening on clvmd startup is that locking is failing for local filesystems but clvmd only "sees" the "lock failure" state. I think this is generating the connect failure messages somehow. It is probably assuming that the other node has a lock on the non-clustered VG freaks out and reports it as a connect failure. vgchange -ay then goes into a dead-lock situation waiting for the other node to relase the lock the first node thinks its holding. This will never happen because clvmd on the other node has no knowledge of the first nodes local filesystem UUIDs. Therefor, clvmd startup ends up in a deadlock situation for a local VG that doesn't need locking in the first place! LVM2 Logic should be added/corrected so that it ignores cluster locking when addressing non-clustered volumes. Non-clustered volume groups have a special "clustered" status whereas local volumes do not. Perhaps this state can be used in some way? Version-Release number of selected component (if applicable): lvm2-cluster-2.0.1.09-3.0 How reproducible: Always Steps to Reproduce: 1.Setup CLVM on one node 2.Bring up another node 3. Actual Results: The other node can't see the CLVM stuff and cannot start CLVM Expected Results: The other node shouldn't get these errors. Additional info:
Created attachment 117684 [details] straces of clvmd and vgchange in the init script while the box is starting up
Did you mark the VG non-clustered ? vghcnage -cn vg0
Not AFAIK, the only clustered filesystem is alice on e0.0, vg0 is the local VG. Output of vgdisplay doesn't indicate vg0 is clustered, only alice. Though I just noticed, alice does not show as "Shared", could that be relivant to this issue? [root@twedldum ~]# vgdisplay --- Volume group --- VG Name vg0 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 55 VG Access read/write VG Status resizable MAX LV 0 Cur LV 8 Open LV 5 Max PV 0 Cur PV 1 Act PV 1 VG Size 38.16 GB PE Size 32.00 MB Total PE 1221 Alloc PE / Size 1200 / 37.50 GB Free PE / Size 21 / 672.00 MB VG UUID zxDALJ-yxHm-oZQ6-qxuh-o4QM-fvZu-qU9Gbu --- Volume group --- VG Name alice System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 46 VG Access read/write VG Status resizable Clustered yes Shared no MAX LV 0 Cur LV 5 Open LV 4 Max PV 0 Cur PV 1 Act PV 1 VG Size 33.73 GB PE Size 8.00 MB Total PE 4318 Alloc PE / Size 921 / 7.20 GB Free PE / Size 3397 / 26.54 GB VG UUID YmkiNE-ATUE-PHRV-vj5l-h4Ci-bnok-X1Ln7d
[root@twedldum ~]# vgchange -cn /dev/vg0 /dev/cdrom: open failed: Read-only file system Volume group "vg0" is already not clustered [root@twedldum ~]# vgchange -cy /dev/alice /dev/cdrom: open failed: Read-only file system Volume group "alice" is already clustered
A fix for this is in CVS, but I'm not sure when it will arrive in a package.