Bug 1658664

Summary: LVM-Activate should not require locking_type
Product: Red Hat Enterprise Linux 8 Reporter: Nate Straz <nstraz>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED CURRENTRELEASE QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.0CC: abeekhof, agk, cfeist, cluster-maint, fdinitto, gfs2-maint, oalbrigt, rhandlin, teigland
Target Milestone: rcKeywords: Regression
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: resource-agents-4.1.1-17.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-14 01:22:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test Logs w/ pe-input
none
ogs from dash-03 w/ resource-agents-4.1.1-16.el8.x86_64 none

Description Nate Straz 2018-12-12 16:07:24 UTC
Description of problem:

When a node fails with a GFS2 Filesystem resource, we expect it to be mounted after the node reboots and rejoins the cluster.  This did not happen, instead pacemaker tried to stop the GFS2 Filesystem resource, killing the load in the process.


Version-Release number of selected component (if applicable):
kernel-4.18.0-50.el8.x86_64
pacemaker-2.0.0-11.el8.x86_64
corosync-2.99.4-2.el8.x86_64
gfs2-utils-3.2.0-3.el8.x86_64
resource-agents-4.1.1-15.el8.x86_64

How reproducible:
Easily

Steps to Reproduce:
1. Run file system recovery test (revolver)
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Nate Straz 2018-12-13 15:27:55 UTC
Created attachment 1514093 [details]
Test Logs w/ pe-input

Comment 2 Ken Gaillot 2018-12-19 00:35:56 UTC
PWhen dash-01 rejoined, the ocf:heartbeat:LVM-activate agent returned OCF_ERR_CONFIGURED for the probes of lv_dash[012] on dash-01.

This is a problem that comes up repeatedly, with a clash between how the OCF standard, resource agent developers, and pacemaker interpret the OCF return codes. Long-term, the OCF standard needs to be updated to be more explicit about these interpretations, and both resource agents and pacemaker ensured to conform to that. Unfortunately that is not on the immediate horizon.

Currently, pacemaker interprets OCF_ERR_CONFIGURED as meaning the resource's configuration *in pacemaker* is invalid, and thus it is a fatal error that renders the resource unable to run anywhere. This is why the resource was stopped on all nodes.

I would guess the LVM-activate agent does not use OCF_ERR_CONFIGURED for that purpose, but rather to indicate that something in the local node's environment is not correct. Pacemaker would expect OCF_ERR_ARGS or OCF_ERR_INSTALLED for that case, which would just prevent the resource from running on that one node.

A second, related issue is covered by Bug 1506372. The question there centers around how probe errors should be reported. Currently, pacemaker considers anything but OCF_NOT_RUNNING to be a failure, whereas there is a plausible argument that OCF_ERR_ARGS or OCF_ERR_INSTALLED should be accepted as equivalent to not running in the probe case. The downside is there's no way to distinguish "ARGS/INSTALLED is really OK in this case" from "ARGS/INSTALLED is a real error that the admin should know about". The most common way around that is for individual resource agents to check whether a monitor action is a probe, and return OCF_NOT_RUNNING for such cases so pacemaker treats it correctly.

A third issue is that it is plausible that probes of dependent resources should be delayed until the depended-on resource is active. In this case, LVM-activate shouldn't be probed until the locking group is active. However pacemaker currently doesn't order probes in this way, and doing so would be nontrivial to say the least. So again, for the time being, it's up to the resource agent to return OCF_NOT_RUNNING in situations where this is expected.

Since Bug 1606372 already covers the part of this that's directly related to pacemaker, I'm reassigning this to resource-agents so it can be investigated whether the LVM-activate agent should behave differently for probes.

Comment 11 David Teigland 2019-01-23 16:15:30 UTC
The locking_type setting is no longer used by lvm in rhel8, so you shouldn't use it.  If you do set it, lvm will print a warning about it.

Comment 16 Nate Straz 2019-01-23 20:48:49 UTC
(In reply to Ken Gaillot from comment #2)
> A third issue is that it is plausible that probes of dependent resources
> should be delayed until the depended-on resource is active. In this case,
> LVM-activate shouldn't be probed until the locking group is active. However
> pacemaker currently doesn't order probes in this way, and doing so would be
> nontrivial to say the least. So again, for the time being, it's up to the
> resource agent to return OCF_NOT_RUNNING in situations where this is
> expected.

This is still an issue.  You can see in these logs that the probe actions are happening while lvmlockd is trying to start.


Jan 23 14:24:22 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: State transition S_PENDING -> S_NOT_DC                                                                       
Jan 23 14:24:22 dash-03.lab.msp.redhat.com pacemaker-attrd[2972]:  notice: Updating all attributes after cib_refresh_notify event                                                          
Jan 23 14:24:24 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for dash-01-drac on dash-03: 7 (not running)                                       
Jan 23 14:24:24 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for dash-02-drac on dash-03: 7 (not running)                                       
Jan 23 14:24:24 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for dash-03-drac on dash-03: 7 (not running)                                       
Jan 23 14:24:24 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for dlm on dash-03: 7 (not running)                                                
Jan 23 14:24:25 dash-03.lab.msp.redhat.com lvmlockd(lvmlockd)[3296]: INFO: lvmlockd not running                                                                                            
Jan 23 14:24:25 dash-03.lab.msp.redhat.com Filesystem(gfs2_dash1)[3298]: WARNING: Couldn't find device [/dev/dash/dash1]. Expected /dev/??? to exist                                       
Jan 23 14:24:25 dash-03.lab.msp.redhat.com Filesystem(gfs2_dash2)[3295]: WARNING: Couldn't find device [/dev/dash/dash2]. Expected /dev/??? to exist                                       
Jan 23 14:24:25 dash-03.lab.msp.redhat.com Filesystem(gfs2_dash0)[3297]: WARNING: Couldn't find device [/dev/dash/dash0]. Expected /dev/??? to exist                                       
Jan 23 14:24:25 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for lvmlockd on dash-03: 7 (not running)                                           
Jan 23 14:24:25 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for gfs2_dash2 on dash-03: 7 (not running)                                         
Jan 23 14:24:25 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for gfs2_dash1 on dash-03: 7 (not running)                                         
Jan 23 14:24:25 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for gfs2_dash0 on dash-03: 7 (not running)                                         
Jan 23 14:24:25 dash-03.lab.msp.redhat.com kernel: DLM installed
Jan 23 14:24:26 dash-03.lab.msp.redhat.com LVM-activate(lv_dash1)[3519]: ERROR: lvmlockd daemon is not running!                                                                            
Jan 23 14:24:26 dash-03.lab.msp.redhat.com LVM-activate(lv_dash0)[3520]: ERROR: lvmlockd daemon is not running!                                                                            
Jan 23 14:24:26 dash-03.lab.msp.redhat.com LVM-activate(lv_dash2)[3521]: ERROR: lvmlockd daemon is not running!                                                                            
Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-execd[2971]:  notice: lv_dash1_monitor_0:3230:stderr [ ocf-exit-reason:lvmlockd daemon is not running! ]                              
Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-execd[2971]:  notice: lv_dash2_monitor_0:3233:stderr [ ocf-exit-reason:lvmlockd daemon is not running! ]                              
Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-execd[2971]:  notice: lv_dash0_monitor_0:3191:stderr [ ocf-exit-reason:lvmlockd daemon is not running! ]                              
Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for lv_dash1 on dash-03: 6 (not configured)                                        
Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: dash-03-lv_dash1_monitor_0:38 [ ocf-exit-reason:lvmlockd daemon is not running!\n ]                          
Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for lv_dash2 on dash-03: 6 (not configured)                                        
Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: dash-03-lv_dash2_monitor_0:48 [ ocf-exit-reason:lvmlockd daemon is not running!\n ]                          
Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of probe operation for lv_dash0 on dash-03: 6 (not configured)                                        
Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: dash-03-lv_dash0_monitor_0:28 [ ocf-exit-reason:lvmlockd daemon is not running!\n ]                          
Jan 23 14:24:26 dash-03.lab.msp.redhat.com dlm_controld[3528]: 56 dlm_controld 4.0.6 started
Jan 23 14:24:28 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of start operation for dlm on dash-03: 0 (ok)                                                         
Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of start operation for dash-03-drac on dash-03: 0 (ok)                                                
Jan 23 14:24:30 dash-03.lab.msp.redhat.com LVM-activate(lv_dash0)[3998]: INFO: dash/dash0: has already been deactivated.                                                                   
Jan 23 14:24:30 dash-03.lab.msp.redhat.com LVM-activate(lv_dash1)[4001]: INFO: dash/dash1: has already been deactivated.                                                                   
Jan 23 14:24:30 dash-03.lab.msp.redhat.com LVM-activate(lv_dash2)[4009]: INFO: dash/dash2: has already been deactivated.                                                                   
Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of stop operation for lv_dash0 on dash-03: 0 (ok)                                                     
Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of stop operation for lv_dash1 on dash-03: 0 (ok)                                                     
Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of stop operation for lv_dash2 on dash-03: 0 (ok)                                                     
Jan 23 14:24:30 dash-03.lab.msp.redhat.com lvmlockd(lvmlockd)[4088]: INFO: checking if DLM is started first...                                                                             
Jan 23 14:24:30 dash-03.lab.msp.redhat.com lvmlockd(lvmlockd)[4121]: INFO: starting cmirrord...                                                                                            
Jan 23 14:24:30 dash-03.lab.msp.redhat.com cmirrord[4197]: Starting cmirrord:
Jan 23 14:24:30 dash-03.lab.msp.redhat.com cmirrord[4197]:  Built: Jan  4 2019 09:49:19
Jan 23 14:24:30 dash-03.lab.msp.redhat.com lvmlockd(lvmlockd)[4205]: INFO: starting lvmlockd...                                                                                            
Jan 23 14:24:30 dash-03.lab.msp.redhat.com lvmlockd[4251]: 1548275070 lvmlockd started
Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]:  notice: Result of start operation for lvmlockd on dash-03: 0 (ok)

Comment 20 David Teigland 2019-01-24 15:08:00 UTC
For lvm 2.2 (in rhel7) lvmlockd requires locking_type=1, for lvm 2.3 (in rhel8) lvm does not use locking_type, but if you set it, it won't break.

I don't know how resource agents are supposed to handle different versions of the program they are paired with.  (For rhel8 we could also remove all the clvm parts of the agent since clvm doesn't exist in rhel8.)

Comment 21 Chris Feist 2019-01-24 15:59:45 UTC
We'll just make the RHEL 8 version of the LVM-activate resource agent not require locking_type and the RHEL 7 will continue to require it.  Would that make sense?

Comment 22 David Teigland 2019-01-24 16:03:27 UTC
> We'll just make the RHEL 8 version of the LVM-activate resource agent not
> require locking_type and the RHEL 7 will continue to require it.  Would that
> make sense?

yes

Comment 23 Nate Straz 2019-01-25 14:31:48 UTC
Created attachment 1523528 [details]
ogs from dash-03 w/ resource-agents-4.1.1-16.el8.x86_64

I have resource-agents-4.1.1-16.el8.x86_64 installed now and I've made a little more progress.  I've had to update the test case to look for different messages.  Some of the file systems are now mounting, but there is still a race between lvmlockd starting lockspaces and activating the logical volumes.

Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR:  VG dash lock skipped: lock start in progress Cannot access VG dash due to failed lock.
Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR: dash/dash2: failed to activate.

Comment 29 David Teigland 2019-01-25 15:31:01 UTC
> Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR:  VG dash lock skipped:
> lock start in progress Cannot access VG dash due to failed lock.
> Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR: dash/dash2: failed to
> activate.

That sounds like the issue we just fixed in the past week in response to this github report
https://github.com/ClusterLabs/resource-agents/issues/1287

we would need a bz (this one?) to pull this fix into 8.0
https://sourceware.org/git/?p=lvm2.git;a=commit;h=e158835a05963c8cb4166100c9d01e7ff3991c82

Comment 30 David Teigland 2019-01-25 15:31:33 UTC
> Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR:  VG dash lock skipped:
> lock start in progress Cannot access VG dash due to failed lock.
> Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR: dash/dash2: failed to
> activate.

That sounds like the issue we just fixed in the past week in response to this github report
https://github.com/ClusterLabs/resource-agents/issues/1287

we would need a bz (this one?) to pull this fix into 8.0
https://sourceware.org/git/?p=lvm2.git;a=commit;h=e158835a05963c8cb4166100c9d01e7ff3991c82

Comment 31 Oyvind Albrigtsen 2019-01-25 15:37:51 UTC
Nate: can you manually apply that patch and see if it solves the issue, or should I make you a scratch-build for testing?

Comment 32 Nate Straz 2019-01-25 15:43:23 UTC
It's better to have a scratch build for tracking purposes.

Comment 34 David Teigland 2019-01-25 16:03:29 UTC
I created bug 1669550 for the vgchange --lockstart change.

Comment 35 Oyvind Albrigtsen 2019-02-04 16:06:47 UTC
https://github.com/ClusterLabs/resource-agents/pull/1299

Comment 37 Nate Straz 2019-02-12 21:16:30 UTC
Verified with resource-agents-4.1.1-17.el8.x86_64

[root@host-017 ~]# pcs status --full
Cluster name: STSRHTS17827
Stack: corosync
Current DC: host-016 (1) (version 2.0.1-4.el8-0eb7991564) - partition with quorum
Last updated: Tue Feb 12 15:14:45 2019
Last change: Tue Feb 12 15:11:18 2019 by root via cibadmin on host-016

3 nodes configured
15 resources configured

Online: [ host-016 (1) host-017 (2) host-018 (3) ]

Full list of resources:

 fence-host-016 (stonith:fence_xvm):    Started host-016
 fence-host-017 (stonith:fence_xvm):    Started host-017
 fence-host-018 (stonith:fence_xvm):    Started host-018
 Clone Set: locking-clone [locking]
     Resource Group: locking:0
         dlm    (ocf::pacemaker:controld):      Started host-017
         lvmlockd       (ocf::heartbeat:lvmlockd):      Started host-017
     Resource Group: locking:1
         dlm    (ocf::pacemaker:controld):      Started host-018
         lvmlockd       (ocf::heartbeat:lvmlockd):      Started host-018
     Resource Group: locking:2
         dlm    (ocf::pacemaker:controld):      Started host-016
         lvmlockd       (ocf::heartbeat:lvmlockd):      Started host-016
     Started: [ host-016 host-017 host-018 ]
 Clone Set: STSRHTS178270-clone [STSRHTS178270]
     Resource Group: STSRHTS178270:0
         lv_STSRHTS178270       (ocf::heartbeat:LVM-activate):  Started host-017
         gfs2_STSRHTS178270     (ocf::heartbeat:Filesystem):    Started host-017
     Resource Group: STSRHTS178270:1
         lv_STSRHTS178270       (ocf::heartbeat:LVM-activate):  Started host-018
         gfs2_STSRHTS178270     (ocf::heartbeat:Filesystem):    Started host-018
     Resource Group: STSRHTS178270:2
         lv_STSRHTS178270       (ocf::heartbeat:LVM-activate):  Started host-016
         gfs2_STSRHTS178270     (ocf::heartbeat:Filesystem):    Started host-016
     Started: [ host-016 host-017 host-018 ]

...
[root@host-017 ~]# lvmconfig global/locking_type
  Configuration node global/locking_type not found
[root@host-017 ~]# rpm -q lvm2 resource-agents
lvm2-2.03.02-4.el8.x86_64
resource-agents-4.1.1-17.el8.x86_64