Hide Forgot
Created attachment 1202480 [details] blk-availability debug Description of problem: In 2 node cluster with clvmd after adding new LUN to server, re-scanning scsi, creating PV and extending existing VG to newly added PV, reboot fails - node fails to stop clvmd which causes cluster stac to be running while other daemons are stopping (including networkd). This results in corosync trying to send multicast messaes despite fact that network is already down. This ends in node being fenced instead of gracefully rebooting. Why clvmd fails to stop after extending VG? All subsequent reboots (after 1st failed reboot) work as expected. Observations from testing: * If order of init scripts is changed that K75blk-availability is run before K76clvmd then reboot works fine even after extending VG. * When cluster is stopped manually (service <rgmanager, gfs2, clvmd, cman> stop) before rebooting - reboot works fine even after extending VG. So it looks that extending VG creates some conditions which then affect behaviour of K75blk-availability. Version-Release number of selected component (if applicable): RHEL 6.6 lvm2-2.02.111-2.el6.x86_64 lvm2-cluster-2.02.111-2.el6.x86_64 How reproducible: always - in customer's PROD cluster In customer's test cluster which is running same version of packages as PROD cluster this behaviour is non-reproducible Steps to Reproduce: 1) extend vg 2) service blk-availability stop 3) vgs 4) sleep 180 5) vgs 6) service clvmd stop [FAILED] Actual results: served is fenced during graceful reboot because cluster fails to stop Expected results: cluster stops without issues and server reboots gracefully Additional info: Attaching debug output of manual blk-availability script
Created attachment 1202482 [details] blk-availability debug node 05
I've managed to reproduce. The problem here is that LVM doesn't have up-to-date view of VGs which are available after PVs are removed from the system. [root@rhel6-b ~]# rpm -q lvm2 lvm2-2.02.111-2.el6.x86_64 [root@rhel6-b ~]# lvm dumpconfig --type diff global { locking_type=3 } devices { preferred_names=["^/dev/mpath/", "^/dev/mapper/mpath", "^/dev/[hs]d"] filter=["a|/dev/mapper|", "r|.*|"] } [root@rhel6-b ~]# lsblk -s NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT mpath_dev1 (dm-2) 253:2 0 4G 0 mpath |-sda 8:0 0 4G 0 disk `-sdd 8:48 0 4G 0 disk vg-lvol0 (dm-4) 253:4 0 4M 0 lvm `-mpath_dev2 (dm-3) 253:3 0 4G 0 mpath |-sdb 8:16 0 4G 0 disk `-sdc 8:32 0 4G 0 [root@rhel6-b ~]# blkdeactivate -u -l wholevg Deactivating block devices: [DM]: deactivating mpath device mpath_dev1 (dm-2)... done [LVM]: deactivating Volume Group vg... done [DM]: deactivating mpath device mpath_dev2 (dm-3)... done [root@rhel6-b ~]# vgs Couldn't find device with uuid 4ybKz5-93UJ-kNFk-a3iS-NH3j-gLf4-aJxNpf. VG #PV #LV #SN Attr VSize VFree vg 2 1 0 wz-pnc 7.99g 7.99g [root@rhel6-b ~]# vgs No volume groups found That also means: [root@rhel6-b ~]# service blk-availability stop Stopping block device availability: Deactivating block devices: [DM]: deactivating mpath device mpath_dev1 (dm-2)... done [LVM]: deactivating Volume Group vg... done [DM]: deactivating mpath device mpath_dev2 (dm-3)... done [ OK ] [root@rhel6-b ~]# service clvmd stop Deactivating clustered VG(s): Volume group "vg" not found Skipping volume group vg [FAILED] The "vgs" (or vgdisplay) is used within clvmd init script to collect all clustered VGs to deactivate and then it passed this list of VGs to vgchange -an command. However, the VGs are already deactivated and the underlying PVs are gone too. As visible from the example above, this is some form of caching issue because first "vgs" still displays the VG (while it shouldn't), though the second "vgs" doesn't see it anymore (because the cache has been updated). ======== We've fixed several caching issues in z-stream versions - the last one is lvm2-2.02.111-2.el6_6.6 - with this build, the problem is fixed already: [root@rhel6-b ~]# rpm -q lvm2 lvm2-2.02.111-2.el6_6.6.x86_64 [root@rhel6-b ~]# blkdeactivate -u -l wholevg Deactivating block devices: [DM]: deactivating mpath device mpath_dev1 (dm-2)... done [LVM]: deactivating Volume Group vg... done [DM]: deactivating mpath device mpath_dev2 (dm-3)... done [root@rhel6-b ~]# vgs No volume groups found That also means: [root@rhel6-b ~]# service blk-availability stop Stopping block device availability: Deactivating block devices: [DM]: deactivating mpath device mpath_dev1 (dm-2)... done [LVM]: deactivating Volume Group vg... done [DM]: deactivating mpath device mpath_dev2 (dm-3)... done [ OK ] [root@rhel6-b ~]# service clvmd stop Signaling clvmd to exit [ OK ] clvmd terminated [ OK ] Please, update to latest 6.6.z lvm2 release (lvm2-2.02.111-2.el6_6.6) and let me know if this resolves the issue. If yes, we'll close this bug as CURRENTRELEASE.
(In reply to Peter Rajnoha from comment #4) > Please, update to latest 6.6.z lvm2 release (lvm2-2.02.111-2.el6_6.6) and > let me know if this resolves the issue. (In RHEL6.7 release, this is also a package released via z-stream as lvm2-2.02.118-3.el6_7.3 and higher, in RHEL6.8 it's already resolved within main release.)