Bug 1377342

Summary: clvmd fails to stop after VG is extended
Product: Red Hat Enterprise Linux 6 Reporter: Josef Zimek <pzimek>
Component: lvm2Assignee: Peter Rajnoha <prajnoha>
lvm2 sub component: Clustering / clvmd (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: unspecified CC: agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac
Version: 6.6   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-22 11:08:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
blk-availability debug
none
blk-availability debug node 05 none

Description Josef Zimek 2016-09-19 13:14:28 UTC
Created attachment 1202480 [details]
blk-availability debug

Description of problem:


In 2 node cluster with clvmd after adding new LUN to server, re-scanning scsi, creating PV and extending existing VG to newly added PV, reboot fails - node fails to stop clvmd which causes cluster stac to be running while other daemons are stopping (including networkd). This results in corosync trying to send multicast messaes despite fact that network is already down. This ends in node being fenced instead of gracefully rebooting.


Why clvmd fails to stop after extending VG? All subsequent reboots (after 1st failed reboot) work as expected. 

Observations from testing:

* If order of init scripts is changed that K75blk-availability is run before K76clvmd then reboot works fine even after extending VG.

* When cluster is stopped manually (service <rgmanager, gfs2, clvmd, cman> stop) before rebooting - reboot works fine even after extending VG.


So it looks that extending VG creates some conditions which then affect behaviour of K75blk-availability.




Version-Release number of selected component (if applicable):

RHEL 6.6
lvm2-2.02.111-2.el6.x86_64   
lvm2-cluster-2.02.111-2.el6.x86_64               



How reproducible:
always - in customer's PROD cluster
In customer's test cluster which is running same version of packages as PROD cluster this behaviour is non-reproducible


Steps to Reproduce:
1) extend vg
2) service blk-availability stop
3) vgs
4) sleep 180
5) vgs
6) service clvmd stop [FAILED]


Actual results:
served is fenced during graceful reboot because cluster fails to stop

Expected results:
cluster stops without issues and server reboots gracefully


Additional info:

Attaching debug output of manual blk-availability script

Comment 1 Josef Zimek 2016-09-19 13:29:28 UTC
Created attachment 1202482 [details]
blk-availability debug node 05

Comment 4 Peter Rajnoha 2016-09-20 09:30:52 UTC
I've managed to reproduce. The problem here is that LVM doesn't have up-to-date view of VGs which are available after PVs are removed from the system.

[root@rhel6-b ~]# rpm -q lvm2
lvm2-2.02.111-2.el6.x86_64

[root@rhel6-b ~]# lvm dumpconfig --type diff
global {
	locking_type=3
}
devices {
	preferred_names=["^/dev/mpath/", "^/dev/mapper/mpath", "^/dev/[hs]d"]
	filter=["a|/dev/mapper|", "r|.*|"]
}

[root@rhel6-b ~]# lsblk -s
NAME                    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
mpath_dev1 (dm-2)       253:2    0    4G  0 mpath 
|-sda                     8:0    0    4G  0 disk  
`-sdd                     8:48   0    4G  0 disk  
vg-lvol0 (dm-4)         253:4    0    4M  0 lvm   
`-mpath_dev2 (dm-3)     253:3    0    4G  0 mpath 
  |-sdb                   8:16   0    4G  0 disk  
  `-sdc                   8:32   0    4G  0 

[root@rhel6-b ~]# blkdeactivate -u -l wholevg
Deactivating block devices:
  [DM]: deactivating mpath device mpath_dev1 (dm-2)... done
  [LVM]: deactivating Volume Group vg... done
  [DM]: deactivating mpath device mpath_dev2 (dm-3)... done

[root@rhel6-b ~]# vgs
  Couldn't find device with uuid 4ybKz5-93UJ-kNFk-a3iS-NH3j-gLf4-aJxNpf.
  VG   #PV #LV #SN Attr   VSize VFree
  vg     2   1   0 wz-pnc 7.99g 7.99g

[root@rhel6-b ~]# vgs   
  No volume groups found


That also means:

[root@rhel6-b ~]# service blk-availability stop
Stopping block device availability: Deactivating block devices:
  [DM]: deactivating mpath device mpath_dev1 (dm-2)... done
  [LVM]: deactivating Volume Group vg... done
  [DM]: deactivating mpath device mpath_dev2 (dm-3)... done
                                                           [  OK  ]
[root@rhel6-b ~]# service clvmd stop
Deactivating clustered VG(s):   Volume group "vg" not found
  Skipping volume group vg
                                                           [FAILED]

The "vgs" (or vgdisplay) is used within clvmd init script to collect all clustered VGs to deactivate and then it passed this list of VGs to vgchange -an command. However, the VGs are already deactivated and the underlying PVs are gone too.

As visible from the example above, this is some form of caching issue because first "vgs" still displays the VG (while it shouldn't), though the second "vgs" doesn't see it anymore (because the cache has been updated).

========

We've fixed several caching issues in z-stream versions - the last one is lvm2-2.02.111-2.el6_6.6 - with this build, the problem is fixed already:

[root@rhel6-b ~]# rpm -q lvm2
lvm2-2.02.111-2.el6_6.6.x86_64

[root@rhel6-b ~]# blkdeactivate -u -l wholevg
Deactivating block devices:
  [DM]: deactivating mpath device mpath_dev1 (dm-2)... done
  [LVM]: deactivating Volume Group vg... done
  [DM]: deactivating mpath device mpath_dev2 (dm-3)... done

[root@rhel6-b ~]# vgs
  No volume groups found


That also means:

[root@rhel6-b ~]# service blk-availability stop
Stopping block device availability: Deactivating block devices:
  [DM]: deactivating mpath device mpath_dev1 (dm-2)... done
  [LVM]: deactivating Volume Group vg... done
  [DM]: deactivating mpath device mpath_dev2 (dm-3)... done
                                                           [  OK  ]
[root@rhel6-b ~]# service clvmd stop
Signaling clvmd to exit                                    [  OK  ]
clvmd terminated                                           [  OK  ]


Please, update to latest 6.6.z lvm2 release (lvm2-2.02.111-2.el6_6.6) and let me know if this resolves the issue. If yes, we'll close this bug as CURRENTRELEASE.

Comment 5 Peter Rajnoha 2016-09-20 09:41:22 UTC
(In reply to Peter Rajnoha from comment #4)
> Please, update to latest 6.6.z lvm2 release (lvm2-2.02.111-2.el6_6.6) and
> let me know if this resolves the issue.

(In RHEL6.7 release, this is also a package released via z-stream as lvm2-2.02.118-3.el6_7.3 and higher, in RHEL6.8 it's already resolved within main release.)