Bug 1628659
| Summary: | LVM-activate: needs to run lvm_validate before a stop action as well | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> | ||||
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | ||||
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.6 | CC: | agk, cfeist, cluster-maint, fdinitto, lmiksik, mlisik, oalbrigt, teigland | ||||
| Target Milestone: | rc | Keywords: | TestBlocker | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | resource-agents-4.1.1-11.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-10-30 11:40:00 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
vg_access_mode is a required input parameter. lvm_validate() just checks that the VGs are consistent with the vg_access_mode. lvm_validate() also switches the var VG_access_mode from the string "lvmlockd" to the required int 1, so that it calls the proper ${access_mode}_deactivate() in stop. Otherwise, if it remains "lvmlockd", instead of either 1,2,3,4, you'll end up with the "is not properly configured in cluster. It's unsafe!"
# lvm_validate:
lvm_validate() {
[...]
case ${VG_access_mode} in
lvmlockd)
VG_access_mode=1
More info on how the resources were set up, w/ vg_access_mode having being provided:
pcs resource create lvm1 --group HA_LVM1 ocf:heartbeat:LVM-activate lvname="ha" vgname="HARDING1" activation_mode=exclusive vg_access_mode=lvmlockd
pcs resource create ha1 --group HA_LVM1 Filesystem device="/dev/HARDING1/ha" directory="/mnt/ha1" fstype="xfs" "options=noatime" op monitor interval=10s
pcs constraint order start lvmlockd-clone then HA_LVM1
pcs resource create lvm2 --group HA_LVM2 ocf:heartbeat:LVM-activate lvname="ha" vgname="HARDING2" activation_mode=exclusive vg_access_mode=lvmlockd
pcs resource create ha2 --group HA_LVM2 Filesystem device="/dev/HARDING2/ha" directory="/mnt/ha2" fstype="xfs" "options=noatime" op monitor interval=10s
pcs constraint order start lvmlockd-clone then HA_LVM2
[root@harding-02 ~]# pcs config
Cluster Name: HARDING
Corosync Nodes:
harding-02 harding-03
Pacemaker Nodes:
harding-02 harding-03
Resources:
Clone: dlm_for_lvmlockd-clone
Meta Attrs: interleave=true ordered=true
Resource: dlm_for_lvmlockd (class=ocf provider=pacemaker type=controld)
Operations: monitor interval=30s (dlm_for_lvmlockd-monitor-interval-30s)
start interval=0s timeout=90 (dlm_for_lvmlockd-start-interval-0s)
stop interval=0s timeout=100 (dlm_for_lvmlockd-stop-interval-0s)
Clone: lvmlockd-clone
Meta Attrs: interleave=true ordered=true
Resource: lvmlockd (class=ocf provider=heartbeat type=lvmlockd)
Attributes: with_cmirrord=1
Operations: monitor interval=30s (lvmlockd-monitor-interval-30s)
start interval=0s timeout=90s (lvmlockd-start-interval-0s)
stop interval=0s timeout=90s (lvmlockd-stop-interval-0s)
Group: HA_LVM1
Meta Attrs: target-role=Stopped
Resource: lvm1 (class=ocf provider=heartbeat type=LVM-activate)
Attributes: activation_mode=exclusive lvname=ha vg_access_mode=lvmlockd vgname=HARDING1
Operations: monitor interval=30s timeout=90s (lvm1-monitor-interval-30s)
start interval=0s timeout=90s (lvm1-start-interval-0s)
stop interval=0s timeout=90s (lvm1-stop-interval-0s)
Resource: ha1 (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/HARDING1/ha directory=/mnt/ha1 fstype=xfs options=noatime
Operations: monitor interval=10s (ha1-monitor-interval-10s)
notify interval=0s timeout=60s (ha1-notify-interval-0s)
start interval=0s timeout=60s (ha1-start-interval-0s)
stop interval=0s timeout=60s (ha1-stop-interval-0s)
Group: HA_LVM2
Meta Attrs: target-role=Stopped
Resource: lvm2 (class=ocf provider=heartbeat type=LVM-activate)
Attributes: activation_mode=exclusive lvname=ha vg_access_mode=lvmlockd vgname=HARDING2
Operations: monitor interval=30s timeout=90s (lvm2-monitor-interval-30s)
start interval=0s timeout=90s (lvm2-start-interval-0s)
stop interval=0s timeout=90s (lvm2-stop-interval-0s)
Resource: ha2 (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/HARDING2/ha directory=/mnt/ha2 fstype=xfs options=noatime
Operations: monitor interval=10s (ha2-monitor-interval-10s)
notify interval=0s timeout=60s (ha2-notify-interval-0s)
start interval=0s timeout=60s (ha2-start-interval-0s)
stop interval=0s timeout=60s (ha2-stop-interval-0s)
Stonith Devices:
Resource: smoke-apc (class=stonith type=fence_apc)
Attributes: delay=5 ipaddr=smoke-apc login=apc passwd=apc pcmk_host_check=static-list pcmk_host_list=harding-02,harding-03 pcmk_host_map=harding-02:3;harding-03:4 switch=1
Operations: monitor interval=60s (smoke-apc-monitor-interval-60s)
Fencing Levels:
Location Constraints:
Ordering Constraints:
start lvmlockd-clone then start HA_LVM1 (kind:Mandatory)
start lvmlockd-clone then start HA_LVM2 (kind:Mandatory)
Resource Sets:
set dlm_for_lvmlockd-clone lvmlockd-clone sequential=true
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: HARDING
dc-version: 1.1.19-7.el7-c3c624ea3d
have-watchdog: false
last-lrm-refresh: 1536851845
no-quorum-policy: freeze
Quorum:
Options:
It's hard to understand how this agent wasn't sanity checked when it was originally written. Also, the code for handling the access mode is rather sloppy, overwriting the string value with a numeric value in the same variable. The processing of input parameters should be done outside of lvm_validate since stop/status shouldn't be doing the rest of the validate function. Created attachment 1483124 [details]
UNTESTED diff
This diff illustrates the kind of changes I think make sense here. It's untested so will need some verification.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3278 |
Description of problem: I'm not sure if I'm missing a reason why lvm_validate is not run in stop mode, but w/o the lvm_validate run, the disable/stop action fails due to a "not properly configured in cluster" when in lvmlockd mode, but the reality is the access mode just never gets set from "lvmlockd" -> "1" for the proper case selection. I see the old LVM agent also doesn't run a validate when stopping either, but it didn't call different deactivate functions due to access mode either. # I added extra debugging to see what was causing the script to think my config was invalid: # Current: pcs resource enable HA_LVM1 Sep 13 11:01:52 harding-03 LVM-activate(lvm1)[41037]: ERROR: IN get_VG_access_mode RETURNING MODE: 1 Sep 13 11:01:53 harding-03 LVM-activate(lvm1)[41037]: ERROR: WE ARE about to run 1 _CHECK Sep 13 11:01:53 harding-03 LVM-activate(lvm1)[41037]: INFO: Activating HARDING1/ha ACCESS: 1 Sep 13 11:01:53 harding-03 LVM-activate(lvm1)[41037]: ERROR: WE ARE IN lvmlockd_Activate!!!! [88247.745118] dlm: Using TCP for communications Sep 13 11:01:53 harding-03 kernel: dlm: Using TCP for communications # Current: pcs resource disable HA_LVM1 case ${VG_access_mode} in 1) lvmlockd_deactivate ;; 2) clvmd_deactivate ;; 3) systemid_deactivate ;; 4) tagging_deactivate ;; *) ocf_log err "VG [${VG}] is not properly configured in cluster. It's unsafe! ACCESS MODE:${VG_access_mode}" exit $OCF_SUCCESS ;; esac Sep 13 11:02:35 harding-03 kernel: dlm: got connection from 1 Sep 13 11:03:15 harding-03 Filesystem(ha1)[41976]: INFO: Running stop for /dev/HARDING1/ha on /mnt/ha1 Sep 13 11:03:15 harding-03 Filesystem(ha1)[41976]: INFO: Trying to unmount /mnt/ha1 [88330.295675] XFS (dm-23): Unmounting Filesystem Sep 13 11:03:15 harding-03 kernel: XFS (dm-23): Unmounting Filesystem Sep 13 11:03:15 harding-03 Filesystem(ha1)[41976]: INFO: unmounted /mnt/ha1 successfully Sep 13 11:03:15 harding-03 crmd[39755]: notice: Result of stop operation for ha1 on harding-03: 0 (ok) Sep 13 11:03:15 harding-03 LVM-activate(lvm1)[42055]: INFO: Deactivating HARDING1/ha ACCESS:lvmlockd Sep 13 11:03:15 harding-03 LVM-activate(lvm1)[42055]: ERROR: VG [HARDING1] is not properly configured in cluster. It's unsafe! ACCESS MODE:lvmlockd # Adding a lvm_validate before the stop: stop) lvm_validate lvm_stop ;; ep 13 11:10:06 harding-03 Filesystem(ha1)[44044]: INFO: Running stop for /dev/HARDING1/ha on /mnt/ha1 Sep 13 11:10:06 harding-03 Filesystem(ha1)[44044]: INFO: Trying to unmount /mnt/ha1 [88741.144433] XFS (dm-23): Unmounting Filesystem Sep 13 11:10:06 harding-03 kernel: XFS (dm-23): Unmounting Filesystem Sep 13 11:10:06 harding-03 Filesystem(ha1)[44044]: INFO: unmounted /mnt/ha1 successfully Sep 13 11:10:06 harding-03 crmd[39755]: notice: Result of stop operation for ha1 on harding-03: 0 (ok) Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: ERROR: IN get_VG_access_mode RETURNING MODE: 1 Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: ERROR: WE ARE about to run 1 _CHECK Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: INFO: Deactivating HARDING1/ha ACCESS:1 Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: ERROR: WE ARE IN lvmlockd_deactivate!!!! Sep 13 11:10:07 harding-03 dmeventd[25300]: No longer monitoring RAID device HARDING1-ha for events. Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: INFO: HARDING1/ha: deactivated successfully. Version-Release number of selected component (if applicable): resource-agents-4.1.1-10.el7 How reproducible: Everytime