Bug 1658664
Summary: | LVM-Activate should not require locking_type | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Nate Straz <nstraz> | ||||||
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 8.0 | CC: | abeekhof, agk, cfeist, cluster-maint, fdinitto, gfs2-maint, oalbrigt, rhandlin, teigland | ||||||
Target Milestone: | rc | Keywords: | Regression | ||||||
Target Release: | 8.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | resource-agents-4.1.1-17.el8 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-06-14 01:22:44 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Nate Straz
2018-12-12 16:07:24 UTC
Created attachment 1514093 [details]
Test Logs w/ pe-input
PWhen dash-01 rejoined, the ocf:heartbeat:LVM-activate agent returned OCF_ERR_CONFIGURED for the probes of lv_dash[012] on dash-01. This is a problem that comes up repeatedly, with a clash between how the OCF standard, resource agent developers, and pacemaker interpret the OCF return codes. Long-term, the OCF standard needs to be updated to be more explicit about these interpretations, and both resource agents and pacemaker ensured to conform to that. Unfortunately that is not on the immediate horizon. Currently, pacemaker interprets OCF_ERR_CONFIGURED as meaning the resource's configuration *in pacemaker* is invalid, and thus it is a fatal error that renders the resource unable to run anywhere. This is why the resource was stopped on all nodes. I would guess the LVM-activate agent does not use OCF_ERR_CONFIGURED for that purpose, but rather to indicate that something in the local node's environment is not correct. Pacemaker would expect OCF_ERR_ARGS or OCF_ERR_INSTALLED for that case, which would just prevent the resource from running on that one node. A second, related issue is covered by Bug 1506372. The question there centers around how probe errors should be reported. Currently, pacemaker considers anything but OCF_NOT_RUNNING to be a failure, whereas there is a plausible argument that OCF_ERR_ARGS or OCF_ERR_INSTALLED should be accepted as equivalent to not running in the probe case. The downside is there's no way to distinguish "ARGS/INSTALLED is really OK in this case" from "ARGS/INSTALLED is a real error that the admin should know about". The most common way around that is for individual resource agents to check whether a monitor action is a probe, and return OCF_NOT_RUNNING for such cases so pacemaker treats it correctly. A third issue is that it is plausible that probes of dependent resources should be delayed until the depended-on resource is active. In this case, LVM-activate shouldn't be probed until the locking group is active. However pacemaker currently doesn't order probes in this way, and doing so would be nontrivial to say the least. So again, for the time being, it's up to the resource agent to return OCF_NOT_RUNNING in situations where this is expected. Since Bug 1606372 already covers the part of this that's directly related to pacemaker, I'm reassigning this to resource-agents so it can be investigated whether the LVM-activate agent should behave differently for probes. The locking_type setting is no longer used by lvm in rhel8, so you shouldn't use it. If you do set it, lvm will print a warning about it. (In reply to Ken Gaillot from comment #2) > A third issue is that it is plausible that probes of dependent resources > should be delayed until the depended-on resource is active. In this case, > LVM-activate shouldn't be probed until the locking group is active. However > pacemaker currently doesn't order probes in this way, and doing so would be > nontrivial to say the least. So again, for the time being, it's up to the > resource agent to return OCF_NOT_RUNNING in situations where this is > expected. This is still an issue. You can see in these logs that the probe actions are happening while lvmlockd is trying to start. Jan 23 14:24:22 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: State transition S_PENDING -> S_NOT_DC Jan 23 14:24:22 dash-03.lab.msp.redhat.com pacemaker-attrd[2972]: notice: Updating all attributes after cib_refresh_notify event Jan 23 14:24:24 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for dash-01-drac on dash-03: 7 (not running) Jan 23 14:24:24 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for dash-02-drac on dash-03: 7 (not running) Jan 23 14:24:24 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for dash-03-drac on dash-03: 7 (not running) Jan 23 14:24:24 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for dlm on dash-03: 7 (not running) Jan 23 14:24:25 dash-03.lab.msp.redhat.com lvmlockd(lvmlockd)[3296]: INFO: lvmlockd not running Jan 23 14:24:25 dash-03.lab.msp.redhat.com Filesystem(gfs2_dash1)[3298]: WARNING: Couldn't find device [/dev/dash/dash1]. Expected /dev/??? to exist Jan 23 14:24:25 dash-03.lab.msp.redhat.com Filesystem(gfs2_dash2)[3295]: WARNING: Couldn't find device [/dev/dash/dash2]. Expected /dev/??? to exist Jan 23 14:24:25 dash-03.lab.msp.redhat.com Filesystem(gfs2_dash0)[3297]: WARNING: Couldn't find device [/dev/dash/dash0]. Expected /dev/??? to exist Jan 23 14:24:25 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for lvmlockd on dash-03: 7 (not running) Jan 23 14:24:25 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for gfs2_dash2 on dash-03: 7 (not running) Jan 23 14:24:25 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for gfs2_dash1 on dash-03: 7 (not running) Jan 23 14:24:25 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for gfs2_dash0 on dash-03: 7 (not running) Jan 23 14:24:25 dash-03.lab.msp.redhat.com kernel: DLM installed Jan 23 14:24:26 dash-03.lab.msp.redhat.com LVM-activate(lv_dash1)[3519]: ERROR: lvmlockd daemon is not running! Jan 23 14:24:26 dash-03.lab.msp.redhat.com LVM-activate(lv_dash0)[3520]: ERROR: lvmlockd daemon is not running! Jan 23 14:24:26 dash-03.lab.msp.redhat.com LVM-activate(lv_dash2)[3521]: ERROR: lvmlockd daemon is not running! Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-execd[2971]: notice: lv_dash1_monitor_0:3230:stderr [ ocf-exit-reason:lvmlockd daemon is not running! ] Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-execd[2971]: notice: lv_dash2_monitor_0:3233:stderr [ ocf-exit-reason:lvmlockd daemon is not running! ] Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-execd[2971]: notice: lv_dash0_monitor_0:3191:stderr [ ocf-exit-reason:lvmlockd daemon is not running! ] Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for lv_dash1 on dash-03: 6 (not configured) Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: dash-03-lv_dash1_monitor_0:38 [ ocf-exit-reason:lvmlockd daemon is not running!\n ] Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for lv_dash2 on dash-03: 6 (not configured) Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: dash-03-lv_dash2_monitor_0:48 [ ocf-exit-reason:lvmlockd daemon is not running!\n ] Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of probe operation for lv_dash0 on dash-03: 6 (not configured) Jan 23 14:24:26 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: dash-03-lv_dash0_monitor_0:28 [ ocf-exit-reason:lvmlockd daemon is not running!\n ] Jan 23 14:24:26 dash-03.lab.msp.redhat.com dlm_controld[3528]: 56 dlm_controld 4.0.6 started Jan 23 14:24:28 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of start operation for dlm on dash-03: 0 (ok) Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of start operation for dash-03-drac on dash-03: 0 (ok) Jan 23 14:24:30 dash-03.lab.msp.redhat.com LVM-activate(lv_dash0)[3998]: INFO: dash/dash0: has already been deactivated. Jan 23 14:24:30 dash-03.lab.msp.redhat.com LVM-activate(lv_dash1)[4001]: INFO: dash/dash1: has already been deactivated. Jan 23 14:24:30 dash-03.lab.msp.redhat.com LVM-activate(lv_dash2)[4009]: INFO: dash/dash2: has already been deactivated. Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of stop operation for lv_dash0 on dash-03: 0 (ok) Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of stop operation for lv_dash1 on dash-03: 0 (ok) Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of stop operation for lv_dash2 on dash-03: 0 (ok) Jan 23 14:24:30 dash-03.lab.msp.redhat.com lvmlockd(lvmlockd)[4088]: INFO: checking if DLM is started first... Jan 23 14:24:30 dash-03.lab.msp.redhat.com lvmlockd(lvmlockd)[4121]: INFO: starting cmirrord... Jan 23 14:24:30 dash-03.lab.msp.redhat.com cmirrord[4197]: Starting cmirrord: Jan 23 14:24:30 dash-03.lab.msp.redhat.com cmirrord[4197]: Built: Jan 4 2019 09:49:19 Jan 23 14:24:30 dash-03.lab.msp.redhat.com lvmlockd(lvmlockd)[4205]: INFO: starting lvmlockd... Jan 23 14:24:30 dash-03.lab.msp.redhat.com lvmlockd[4251]: 1548275070 lvmlockd started Jan 23 14:24:30 dash-03.lab.msp.redhat.com pacemaker-controld[2974]: notice: Result of start operation for lvmlockd on dash-03: 0 (ok) For lvm 2.2 (in rhel7) lvmlockd requires locking_type=1, for lvm 2.3 (in rhel8) lvm does not use locking_type, but if you set it, it won't break. I don't know how resource agents are supposed to handle different versions of the program they are paired with. (For rhel8 we could also remove all the clvm parts of the agent since clvm doesn't exist in rhel8.) We'll just make the RHEL 8 version of the LVM-activate resource agent not require locking_type and the RHEL 7 will continue to require it. Would that make sense? > We'll just make the RHEL 8 version of the LVM-activate resource agent not
> require locking_type and the RHEL 7 will continue to require it. Would that
> make sense?
yes
Created attachment 1523528 [details]
ogs from dash-03 w/ resource-agents-4.1.1-16.el8.x86_64
I have resource-agents-4.1.1-16.el8.x86_64 installed now and I've made a little more progress. I've had to update the test case to look for different messages. Some of the file systems are now mounting, but there is still a race between lvmlockd starting lockspaces and activating the logical volumes.
Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR: VG dash lock skipped: lock start in progress Cannot access VG dash due to failed lock.
Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR: dash/dash2: failed to activate.
> Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR: VG dash lock skipped: > lock start in progress Cannot access VG dash due to failed lock. > Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR: dash/dash2: failed to > activate. That sounds like the issue we just fixed in the past week in response to this github report https://github.com/ClusterLabs/resource-agents/issues/1287 we would need a bz (this one?) to pull this fix into 8.0 https://sourceware.org/git/?p=lvm2.git;a=commit;h=e158835a05963c8cb4166100c9d01e7ff3991c82 > Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR: VG dash lock skipped: > lock start in progress Cannot access VG dash due to failed lock. > Jan 25 08:25:23 dash-03 LVM-activate(lv_dash2) ERROR: dash/dash2: failed to > activate. That sounds like the issue we just fixed in the past week in response to this github report https://github.com/ClusterLabs/resource-agents/issues/1287 we would need a bz (this one?) to pull this fix into 8.0 https://sourceware.org/git/?p=lvm2.git;a=commit;h=e158835a05963c8cb4166100c9d01e7ff3991c82 Nate: can you manually apply that patch and see if it solves the issue, or should I make you a scratch-build for testing? It's better to have a scratch build for tracking purposes. I created bug 1669550 for the vgchange --lockstart change. Verified with resource-agents-4.1.1-17.el8.x86_64 [root@host-017 ~]# pcs status --full Cluster name: STSRHTS17827 Stack: corosync Current DC: host-016 (1) (version 2.0.1-4.el8-0eb7991564) - partition with quorum Last updated: Tue Feb 12 15:14:45 2019 Last change: Tue Feb 12 15:11:18 2019 by root via cibadmin on host-016 3 nodes configured 15 resources configured Online: [ host-016 (1) host-017 (2) host-018 (3) ] Full list of resources: fence-host-016 (stonith:fence_xvm): Started host-016 fence-host-017 (stonith:fence_xvm): Started host-017 fence-host-018 (stonith:fence_xvm): Started host-018 Clone Set: locking-clone [locking] Resource Group: locking:0 dlm (ocf::pacemaker:controld): Started host-017 lvmlockd (ocf::heartbeat:lvmlockd): Started host-017 Resource Group: locking:1 dlm (ocf::pacemaker:controld): Started host-018 lvmlockd (ocf::heartbeat:lvmlockd): Started host-018 Resource Group: locking:2 dlm (ocf::pacemaker:controld): Started host-016 lvmlockd (ocf::heartbeat:lvmlockd): Started host-016 Started: [ host-016 host-017 host-018 ] Clone Set: STSRHTS178270-clone [STSRHTS178270] Resource Group: STSRHTS178270:0 lv_STSRHTS178270 (ocf::heartbeat:LVM-activate): Started host-017 gfs2_STSRHTS178270 (ocf::heartbeat:Filesystem): Started host-017 Resource Group: STSRHTS178270:1 lv_STSRHTS178270 (ocf::heartbeat:LVM-activate): Started host-018 gfs2_STSRHTS178270 (ocf::heartbeat:Filesystem): Started host-018 Resource Group: STSRHTS178270:2 lv_STSRHTS178270 (ocf::heartbeat:LVM-activate): Started host-016 gfs2_STSRHTS178270 (ocf::heartbeat:Filesystem): Started host-016 Started: [ host-016 host-017 host-018 ] ... [root@host-017 ~]# lvmconfig global/locking_type Configuration node global/locking_type not found [root@host-017 ~]# rpm -q lvm2 resource-agents lvm2-2.03.02-4.el8.x86_64 resource-agents-4.1.1-17.el8.x86_64 |