Bug 1531465

Summary: HA LVM duplicate activation
Product: Red Hat Enterprise Linux 7 Reporter: Christoph <c.handel>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.4CC: agk, c.handel, cluster-maint, fdinitto, jruemker
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-08 14:43:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Christoph 2018-01-05 08:23:35 UTC
---++ Description of problem:

we have ha lvm volume with clvmd locking (not tagged). Every time a node joins
the cluster the volume cluster resource needs to run a recovery. Resulting in
restarting all dependent resources


---++ Version-Release number of selected component (if applicable):

resource-agents-3.9.5-105.el7_4.3.x86_64


---++ How reproducible:

always


---++ Steps to Reproduce:
1. setup cluster

2. add dlm and clvmd resources

      pcs resource create dlm ocf:pacemaker:controld clone on-fail=fence interleave=true ordered=true
      pcs resource create clvmd ocf:heartbeat:clvm clone on-fail=fence interleave=true ordered=true
      pcs constraint order start dlm-clone then clvmd-clone

3. add a ha lvm resource

      vgcreate -Ay -cy vg_data /dev/mapper/mpath-data
      pcs resource create vg_data ocf:heartbeat:LVM exclusive=yes volgrpname="vg_data"
      pcs constraint order start clvmd-clone then vg_data

4. reboot a node, cluster status reports

   * vg_data_monitor_0 on wsl007 'unknown error' (1): call=49, status=complete,
     exitreason='LVM Volume vg_data is not available',

5. check logfiles, resources are restarted on nodeA

   notice: Initiating monitor operation vg_data_monitor_0 on nodeB
   warning: Action 15 (vg_data_monitor_0) on nodeB failed (target: 7 vs. rc: 1): Error
   warning: Processing failed op monitor for vg_data on nodeB: unknown error (1)
   error: Resource vg_data (ocf::LVM) is active on 2 nodes attempting recovery
   notice:  * Start      clvmd:1          ( nodeB )
   notice:  * Recover    vg_data          ( nodeA )
   notice:  * Restart    fs_data          ( nodeA )   due to required vg_data start

---++ Actual results:

Node B joins the cluster. State of all resources on Node B are discovered.
Clvmd is not running, all clustered volume groups are not available
For this reason the LVM resource agent returns with (LVM_status, line 348-349)

                ocf_exit_reason "LVM Volume $1 is not available"
                return $OCF_ERR_GENERIC

This is not the expected result for pacemaker (wanted OCF_NOT_RUNNING).
Pacemaker detects duplicate resource activation and starts a recovery.

All dependent resources are stopped. The volume is stopped on all nodes.
Volume is started on one node and then all dependent resources are started again


---++ Expected results:

detect that volume is not available and therefore not running. No recovery.


---++ Additional info:

see bug BZ#1454699 which introduced a patch adding vgscan to the resource
agent. Upstream has removed all LVM commands from monitor. See BZ#1507013 which
also has an issue with this patch.

Using clvmd as we also have clusters using gfs2 where we need clvmd/dlm and it is
simpler to run a similar setup for HA-LVM

We have systems running an NFS server on the HA-LVM, resulting in lock recovery
grace period and a 90 second block.

We have java services running on the HA-LVM. Startup time of minutes.

Comment 2 John Ruemker 2018-01-08 14:43:32 UTC
Hello,
Thank you for reporting this.  This issue is being investigated and a solution pursued in Bug #1486888, so I am marking this a duplicate.  

That bug is Red Hat-internal, but if you need assistance, or would like ongoing updates about the state of this investigation, please feel free to open a case with Red Hat Support at https://access.redhat.com and we can help you there. 

Thanks!
John Ruemker
Principal Software Maintenance Engineer

*** This bug has been marked as a duplicate of bug 1486888 ***

Comment 3 Christoph 2018-01-08 14:45:42 UTC
the bug is private. can i get access (or a subscription) to bug #1486888