Bug 1531465

Summary:	HA LVM duplicate activation
Product:	Red Hat Enterprise Linux 7	Reporter:	Christoph <c.handel>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED DUPLICATE	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	agk, c.handel, cluster-maint, fdinitto, jruemker
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-01-08 14:43:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Christoph 2018-01-05 08:23:35 UTC

---++ Description of problem:

we have ha lvm volume with clvmd locking (not tagged). Every time a node joins
the cluster the volume cluster resource needs to run a recovery. Resulting in
restarting all dependent resources

---++ Version-Release number of selected component (if applicable):

resource-agents-3.9.5-105.el7_4.3.x86_64

---++ How reproducible:

always

---++ Steps to Reproduce:
1. setup cluster

2. add dlm and clvmd resources

pcs resource create dlm ocf:pacemaker:controld clone on-fail=fence interleave=true ordered=true
pcs resource create clvmd ocf:heartbeat:clvm clone on-fail=fence interleave=true ordered=true
pcs constraint order start dlm-clone then clvmd-clone

3. add a ha lvm resource

vgcreate -Ay -cy vg_data /dev/mapper/mpath-data
pcs resource create vg_data ocf:heartbeat:LVM exclusive=yes volgrpname="vg_data"
pcs constraint order start clvmd-clone then vg_data

4. reboot a node, cluster status reports

* vg_data_monitor_0 on wsl007 'unknown error' (1): call=49, status=complete,
exitreason='LVM Volume vg_data is not available',

5. check logfiles, resources are restarted on nodeA

notice: Initiating monitor operation vg_data_monitor_0 on nodeB
warning: Action 15 (vg_data_monitor_0) on nodeB failed (target: 7 vs. rc: 1): Error
warning: Processing failed op monitor for vg_data on nodeB: unknown error (1)
error: Resource vg_data (ocf::LVM) is active on 2 nodes attempting recovery
notice: * Start clvmd:1 ( nodeB )
notice: * Recover vg_data ( nodeA )
notice: * Restart fs_data ( nodeA ) due to required vg_data start

---++ Actual results:

Node B joins the cluster. State of all resources on Node B are discovered.
Clvmd is not running, all clustered volume groups are not available
For this reason the LVM resource agent returns with (LVM_status, line 348-349)

ocf_exit_reason "LVM Volume $1 is not available"
return $OCF_ERR_GENERIC

This is not the expected result for pacemaker (wanted OCF_NOT_RUNNING).
Pacemaker detects duplicate resource activation and starts a recovery.

All dependent resources are stopped. The volume is stopped on all nodes.
Volume is started on one node and then all dependent resources are started again

---++ Expected results:

detect that volume is not available and therefore not running. No recovery.

---++ Additional info:

see bug BZ#1454699 which introduced a patch adding vgscan to the resource
agent. Upstream has removed all LVM commands from monitor. See BZ#1507013 which
also has an issue with this patch.

Using clvmd as we also have clusters using gfs2 where we need clvmd/dlm and it is
simpler to run a similar setup for HA-LVM

We have systems running an NFS server on the HA-LVM, resulting in lock recovery
grace period and a 90 second block.

We have java services running on the HA-LVM. Startup time of minutes.

Comment 2 John Ruemker 2018-01-08 14:43:32 UTC

Hello,
Thank you for reporting this.  This issue is being investigated and a solution pursued in Bug #1486888, so I am marking this a duplicate.  

That bug is Red Hat-internal, but if you need assistance, or would like ongoing updates about the state of this investigation, please feel free to open a case with Red Hat Support at https://access.redhat.com and we can help you there. 

Thanks!
John Ruemker
Principal Software Maintenance Engineer

*** This bug has been marked as a duplicate of bug 1486888 ***

Comment 3 Christoph 2018-01-08 14:45:42 UTC

the bug is private. can i get access (or a subscription) to bug #1486888