| Summary: | clvmd fails to stop after VG is extended | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Josef Zimek <pzimek> | ||||||
| Component: | lvm2 | Assignee: | Peter Rajnoha <prajnoha> | ||||||
| lvm2 sub component: | Clustering / clvmd (RHEL6) | QA Contact: | cluster-qe <cluster-qe> | ||||||
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||
| Severity: | high | ||||||||
| Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac | ||||||
| Version: | 6.6 | ||||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2016-09-22 11:08:36 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
Created attachment 1202482 [details]
blk-availability debug node 05
I've managed to reproduce. The problem here is that LVM doesn't have up-to-date view of VGs which are available after PVs are removed from the system.
[root@rhel6-b ~]# rpm -q lvm2
lvm2-2.02.111-2.el6.x86_64
[root@rhel6-b ~]# lvm dumpconfig --type diff
global {
locking_type=3
}
devices {
preferred_names=["^/dev/mpath/", "^/dev/mapper/mpath", "^/dev/[hs]d"]
filter=["a|/dev/mapper|", "r|.*|"]
}
[root@rhel6-b ~]# lsblk -s
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
mpath_dev1 (dm-2) 253:2 0 4G 0 mpath
|-sda 8:0 0 4G 0 disk
`-sdd 8:48 0 4G 0 disk
vg-lvol0 (dm-4) 253:4 0 4M 0 lvm
`-mpath_dev2 (dm-3) 253:3 0 4G 0 mpath
|-sdb 8:16 0 4G 0 disk
`-sdc 8:32 0 4G 0
[root@rhel6-b ~]# blkdeactivate -u -l wholevg
Deactivating block devices:
[DM]: deactivating mpath device mpath_dev1 (dm-2)... done
[LVM]: deactivating Volume Group vg... done
[DM]: deactivating mpath device mpath_dev2 (dm-3)... done
[root@rhel6-b ~]# vgs
Couldn't find device with uuid 4ybKz5-93UJ-kNFk-a3iS-NH3j-gLf4-aJxNpf.
VG #PV #LV #SN Attr VSize VFree
vg 2 1 0 wz-pnc 7.99g 7.99g
[root@rhel6-b ~]# vgs
No volume groups found
That also means:
[root@rhel6-b ~]# service blk-availability stop
Stopping block device availability: Deactivating block devices:
[DM]: deactivating mpath device mpath_dev1 (dm-2)... done
[LVM]: deactivating Volume Group vg... done
[DM]: deactivating mpath device mpath_dev2 (dm-3)... done
[ OK ]
[root@rhel6-b ~]# service clvmd stop
Deactivating clustered VG(s): Volume group "vg" not found
Skipping volume group vg
[FAILED]
The "vgs" (or vgdisplay) is used within clvmd init script to collect all clustered VGs to deactivate and then it passed this list of VGs to vgchange -an command. However, the VGs are already deactivated and the underlying PVs are gone too.
As visible from the example above, this is some form of caching issue because first "vgs" still displays the VG (while it shouldn't), though the second "vgs" doesn't see it anymore (because the cache has been updated).
========
We've fixed several caching issues in z-stream versions - the last one is lvm2-2.02.111-2.el6_6.6 - with this build, the problem is fixed already:
[root@rhel6-b ~]# rpm -q lvm2
lvm2-2.02.111-2.el6_6.6.x86_64
[root@rhel6-b ~]# blkdeactivate -u -l wholevg
Deactivating block devices:
[DM]: deactivating mpath device mpath_dev1 (dm-2)... done
[LVM]: deactivating Volume Group vg... done
[DM]: deactivating mpath device mpath_dev2 (dm-3)... done
[root@rhel6-b ~]# vgs
No volume groups found
That also means:
[root@rhel6-b ~]# service blk-availability stop
Stopping block device availability: Deactivating block devices:
[DM]: deactivating mpath device mpath_dev1 (dm-2)... done
[LVM]: deactivating Volume Group vg... done
[DM]: deactivating mpath device mpath_dev2 (dm-3)... done
[ OK ]
[root@rhel6-b ~]# service clvmd stop
Signaling clvmd to exit [ OK ]
clvmd terminated [ OK ]
Please, update to latest 6.6.z lvm2 release (lvm2-2.02.111-2.el6_6.6) and let me know if this resolves the issue. If yes, we'll close this bug as CURRENTRELEASE.
(In reply to Peter Rajnoha from comment #4) > Please, update to latest 6.6.z lvm2 release (lvm2-2.02.111-2.el6_6.6) and > let me know if this resolves the issue. (In RHEL6.7 release, this is also a package released via z-stream as lvm2-2.02.118-3.el6_7.3 and higher, in RHEL6.8 it's already resolved within main release.) |
Created attachment 1202480 [details] blk-availability debug Description of problem: In 2 node cluster with clvmd after adding new LUN to server, re-scanning scsi, creating PV and extending existing VG to newly added PV, reboot fails - node fails to stop clvmd which causes cluster stac to be running while other daemons are stopping (including networkd). This results in corosync trying to send multicast messaes despite fact that network is already down. This ends in node being fenced instead of gracefully rebooting. Why clvmd fails to stop after extending VG? All subsequent reboots (after 1st failed reboot) work as expected. Observations from testing: * If order of init scripts is changed that K75blk-availability is run before K76clvmd then reboot works fine even after extending VG. * When cluster is stopped manually (service <rgmanager, gfs2, clvmd, cman> stop) before rebooting - reboot works fine even after extending VG. So it looks that extending VG creates some conditions which then affect behaviour of K75blk-availability. Version-Release number of selected component (if applicable): RHEL 6.6 lvm2-2.02.111-2.el6.x86_64 lvm2-cluster-2.02.111-2.el6.x86_64 How reproducible: always - in customer's PROD cluster In customer's test cluster which is running same version of packages as PROD cluster this behaviour is non-reproducible Steps to Reproduce: 1) extend vg 2) service blk-availability stop 3) vgs 4) sleep 180 5) vgs 6) service clvmd stop [FAILED] Actual results: served is fenced during graceful reboot because cluster fails to stop Expected results: cluster stops without issues and server reboots gracefully Additional info: Attaching debug output of manual blk-availability script