Bug 1628659

Summary:

LVM-activate: needs to run lvm_validate before a stop action as well

Product:

Red Hat Enterprise Linux 7

Reporter:

Corey Marthaler <cmarthal>

Component:

resource-agents

Assignee:

Oyvind Albrigtsen <oalbrigt>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

7.6

CC:

agk, cfeist, cluster-maint, fdinitto, lmiksik, mlisik, oalbrigt, teigland

Target Milestone:

Keywords:

TestBlocker

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

resource-agents-4.1.1-11.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-10-30 11:40:00 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
UNTESTED diff	none

Description Corey Marthaler 2018-09-13 16:27:33 UTC

Description of problem:
I'm not sure if I'm missing a reason why lvm_validate is not run in stop mode, but w/o the lvm_validate run, the disable/stop action fails due to a "not properly configured in cluster" when in lvmlockd mode, but the reality is the access mode just never gets set from "lvmlockd" -> "1" for the proper case selection.

I see the old LVM agent also doesn't run a validate when stopping either, but it didn't call different deactivate functions due to access mode either. 



# I added extra debugging to see what was causing the script to think my config was invalid:

# Current:  pcs resource enable HA_LVM1
Sep 13 11:01:52 harding-03 LVM-activate(lvm1)[41037]: ERROR: IN get_VG_access_mode RETURNING MODE: 1
Sep 13 11:01:53 harding-03 LVM-activate(lvm1)[41037]: ERROR: WE ARE about to run  1 _CHECK
Sep 13 11:01:53 harding-03 LVM-activate(lvm1)[41037]: INFO: Activating HARDING1/ha ACCESS: 1
Sep 13 11:01:53 harding-03 LVM-activate(lvm1)[41037]: ERROR: WE ARE IN lvmlockd_Activate!!!!
[88247.745118] dlm: Using TCP for communications
Sep 13 11:01:53 harding-03 kernel: dlm: Using TCP for communications




# Current:  pcs resource disable HA_LVM1
        case ${VG_access_mode} in
        1)
                lvmlockd_deactivate
                ;;
        2)
                clvmd_deactivate
                ;;
        3)
                systemid_deactivate
                ;;
        4)
                tagging_deactivate
                ;;
        *)
                ocf_log err "VG [${VG}] is not properly configured in cluster. It's unsafe! ACCESS MODE:${VG_access_mode}"
                exit $OCF_SUCCESS
                ;;
        esac

Sep 13 11:02:35 harding-03 kernel: dlm: got connection from 1
Sep 13 11:03:15 harding-03 Filesystem(ha1)[41976]: INFO: Running stop for /dev/HARDING1/ha on /mnt/ha1
Sep 13 11:03:15 harding-03 Filesystem(ha1)[41976]: INFO: Trying to unmount /mnt/ha1
[88330.295675] XFS (dm-23): Unmounting Filesystem
Sep 13 11:03:15 harding-03 kernel: XFS (dm-23): Unmounting Filesystem
Sep 13 11:03:15 harding-03 Filesystem(ha1)[41976]: INFO: unmounted /mnt/ha1 successfully
Sep 13 11:03:15 harding-03 crmd[39755]:  notice: Result of stop operation for ha1 on harding-03: 0 (ok)
Sep 13 11:03:15 harding-03 LVM-activate(lvm1)[42055]: INFO: Deactivating HARDING1/ha ACCESS:lvmlockd
Sep 13 11:03:15 harding-03 LVM-activate(lvm1)[42055]: ERROR: VG [HARDING1] is not properly configured in cluster. It's unsafe! ACCESS MODE:lvmlockd



# Adding a lvm_validate before the stop:
stop)
        lvm_validate
        lvm_stop
        ;;


ep 13 11:10:06 harding-03 Filesystem(ha1)[44044]: INFO: Running stop for /dev/HARDING1/ha on /mnt/ha1
Sep 13 11:10:06 harding-03 Filesystem(ha1)[44044]: INFO: Trying to unmount /mnt/ha1
[88741.144433] XFS (dm-23): Unmounting Filesystem
Sep 13 11:10:06 harding-03 kernel: XFS (dm-23): Unmounting Filesystem
Sep 13 11:10:06 harding-03 Filesystem(ha1)[44044]: INFO: unmounted /mnt/ha1 successfully
Sep 13 11:10:06 harding-03 crmd[39755]:  notice: Result of stop operation for ha1 on harding-03: 0 (ok)
Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: ERROR: IN get_VG_access_mode RETURNING MODE: 1
Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: ERROR: WE ARE about to run  1 _CHECK
Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: INFO: Deactivating HARDING1/ha ACCESS:1
Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: ERROR: WE ARE IN lvmlockd_deactivate!!!!
Sep 13 11:10:07 harding-03 dmeventd[25300]: No longer monitoring RAID device HARDING1-ha for events.
Sep 13 11:10:07 harding-03 LVM-activate(lvm1)[44123]: INFO: HARDING1/ha: deactivated successfully.



Version-Release number of selected component (if applicable):
resource-agents-4.1.1-10.el7

How reproducible:
Everytime

Comment 2 David Teigland 2018-09-13 16:44:03 UTC

vg_access_mode is a required input parameter.  lvm_validate() just checks that the VGs are consistent with the vg_access_mode.

Comment 3 Corey Marthaler 2018-09-13 17:39:33 UTC

lvm_validate() also switches the var VG_access_mode from the string "lvmlockd" to the required int 1, so that it calls the proper ${access_mode}_deactivate() in stop. Otherwise, if it remains "lvmlockd", instead of either 1,2,3,4, you'll end up with the "is not properly configured in cluster. It's unsafe!"

# lvm_validate:
lvm_validate() {
        [...]
        case ${VG_access_mode} in
        lvmlockd)
                VG_access_mode=1



More info on how the resources were set up, w/ vg_access_mode having being provided:

pcs resource create lvm1 --group HA_LVM1 ocf:heartbeat:LVM-activate lvname="ha" vgname="HARDING1" activation_mode=exclusive vg_access_mode=lvmlockd
pcs resource create ha1 --group HA_LVM1 Filesystem device="/dev/HARDING1/ha" directory="/mnt/ha1" fstype="xfs" "options=noatime" op monitor interval=10s  
pcs constraint order start lvmlockd-clone then HA_LVM1

pcs resource create lvm2 --group HA_LVM2 ocf:heartbeat:LVM-activate lvname="ha" vgname="HARDING2" activation_mode=exclusive vg_access_mode=lvmlockd
pcs resource create ha2 --group HA_LVM2 Filesystem device="/dev/HARDING2/ha" directory="/mnt/ha2" fstype="xfs" "options=noatime" op monitor interval=10s  
pcs constraint order start lvmlockd-clone then HA_LVM2



[root@harding-02 ~]# pcs config
Cluster Name: HARDING
Corosync Nodes:
 harding-02 harding-03
Pacemaker Nodes:
 harding-02 harding-03

Resources:
 Clone: dlm_for_lvmlockd-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: dlm_for_lvmlockd (class=ocf provider=pacemaker type=controld)
   Operations: monitor interval=30s (dlm_for_lvmlockd-monitor-interval-30s)
               start interval=0s timeout=90 (dlm_for_lvmlockd-start-interval-0s)
               stop interval=0s timeout=100 (dlm_for_lvmlockd-stop-interval-0s)
 Clone: lvmlockd-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: lvmlockd (class=ocf provider=heartbeat type=lvmlockd)
   Attributes: with_cmirrord=1
   Operations: monitor interval=30s (lvmlockd-monitor-interval-30s)
               start interval=0s timeout=90s (lvmlockd-start-interval-0s)
               stop interval=0s timeout=90s (lvmlockd-stop-interval-0s)
 Group: HA_LVM1
  Meta Attrs: target-role=Stopped 
  Resource: lvm1 (class=ocf provider=heartbeat type=LVM-activate)
   Attributes: activation_mode=exclusive lvname=ha vg_access_mode=lvmlockd vgname=HARDING1
   Operations: monitor interval=30s timeout=90s (lvm1-monitor-interval-30s)
               start interval=0s timeout=90s (lvm1-start-interval-0s)
               stop interval=0s timeout=90s (lvm1-stop-interval-0s)
  Resource: ha1 (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/HARDING1/ha directory=/mnt/ha1 fstype=xfs options=noatime
   Operations: monitor interval=10s (ha1-monitor-interval-10s)
               notify interval=0s timeout=60s (ha1-notify-interval-0s)
               start interval=0s timeout=60s (ha1-start-interval-0s)
               stop interval=0s timeout=60s (ha1-stop-interval-0s)
 Group: HA_LVM2
  Meta Attrs: target-role=Stopped 
  Resource: lvm2 (class=ocf provider=heartbeat type=LVM-activate)
   Attributes: activation_mode=exclusive lvname=ha vg_access_mode=lvmlockd vgname=HARDING2
   Operations: monitor interval=30s timeout=90s (lvm2-monitor-interval-30s)
               start interval=0s timeout=90s (lvm2-start-interval-0s)
               stop interval=0s timeout=90s (lvm2-stop-interval-0s)
  Resource: ha2 (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/HARDING2/ha directory=/mnt/ha2 fstype=xfs options=noatime
   Operations: monitor interval=10s (ha2-monitor-interval-10s)
               notify interval=0s timeout=60s (ha2-notify-interval-0s)
               start interval=0s timeout=60s (ha2-start-interval-0s)
               stop interval=0s timeout=60s (ha2-stop-interval-0s)

Stonith Devices:
 Resource: smoke-apc (class=stonith type=fence_apc)
  Attributes: delay=5 ipaddr=smoke-apc login=apc passwd=apc pcmk_host_check=static-list pcmk_host_list=harding-02,harding-03 pcmk_host_map=harding-02:3;harding-03:4 switch=1
  Operations: monitor interval=60s (smoke-apc-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  start lvmlockd-clone then start HA_LVM1 (kind:Mandatory)
  start lvmlockd-clone then start HA_LVM2 (kind:Mandatory)
  Resource Sets:
    set dlm_for_lvmlockd-clone lvmlockd-clone sequential=true
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: HARDING
 dc-version: 1.1.19-7.el7-c3c624ea3d
 have-watchdog: false
 last-lrm-refresh: 1536851845
 no-quorum-policy: freeze

Quorum:
  Options:

Comment 4 David Teigland 2018-09-13 18:34:30 UTC

It's hard to understand how this agent wasn't sanity checked when it was originally written.  Also, the code for handling the access mode is rather sloppy, overwriting the string value with a numeric value in the same variable.  The processing of input parameters should be done outside of lvm_validate since stop/status shouldn't be doing the rest of the validate function.

Comment 5 David Teigland 2018-09-13 18:36:51 UTC

Created attachment 1483124 [details]
UNTESTED diff

This diff illustrates the kind of changes I think make sense here.  It's untested so will need some verification.

Comment 15 errata-xmlrpc 2018-10-30 11:40:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3278