Bug 1320520

Summary: LVM status checks for clustered LVM in Red Hat Cluster Suite gives false results.
Product: Red Hat Enterprise Linux 6 Reporter: Vedran Zivicnjak <zvedran>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.7CC: agk, cluster-maint, fdinitto, jruemker, mnovacek, prajnoha
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: resource-agents-3.9.5-37.el6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-21 09:27:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output of bash -x lvm.sh status command none

Description Vedran Zivicnjak 2016-03-23 12:11:28 UTC
Created attachment 1139529 [details]
Output of bash -x lvm.sh status command

Description of problem:

 In Red Hat Cluster Suite on RHEL 6.7 with clvm HA-LVM enabled setup, volume
groups are created with vgcreate -cy <vgname> <pvpath> command, and with
locking_type 3 in /etc/lvm/lvm.conf. In such setup, each LV that is part of
cluster resource is given to rgmanager in cluster.conf to be checked periodically. 

Rgmanager invokes /usr/share/cluster/lvm.sh status command with environment variables set to check status of each cluster managed logical volumes. If the given status check is called with $OCF_RESKEY_lv_name variable set,
/usr/share/cluster/lvm.sh status command calls lv_status function found in
/usr/share/cluster/lvm_by_lv.sh script. 

status|monitor)
        ocf_log notice "Getting status"

        if [ -z "$OCF_RESKEY_lv_name" ]; then
                vg_status
                exit $?
        else
                lv_status
                exit $?
        fi
        ;;

lv_status function then checks if volume group under $OCF_RESKEY_vg_name environment variable name has "c"
clustered flag set, and calls lv_status_clustered function within the same
script.

function lv_status
{
        # We pass in the VG name to see of the logical volume is clustered
        if [[ $(vgs -o attr --noheadings $OCF_RESKEY_vg_name) =~ .....c ]];
then
                lv_status_clustered
        else
                lv_status_single
        fi
}


lv_status_clustered function should check if given logical volume in volume
group is active as defined in function:

lv_status_clustered()
{
        #
        # Check if device is active
        #
        if [[ ! "$(lvs -o attr --noheadings $lv_path)" =~ ....a. ]]; then
                return $OCF_NOT_RUNNING
        fi

        return $OCF_SUCCESS
}

The problem is that variable $lv_path is not defined through the
/usr/share/cluster/lvm.sh status sequence for the given environment, which
results in executing:

if [[ ! "$(lvs -o attr --noheadings)" =~ ....a. ]]; 

conditional statement which outputs ALL available logical volumes on the
system, and checks if ANY of them has active flag set. 
This behaviour invalidates the intended LV status check of rgmanager. 

Proposed fix:

lv_status_clustered()
{
+	declare lv_path="$OCF_RESKEY_vg_name/$OCF_RESKEY_lv_name"
        #
        # Check if device is active
        #
        if [[ ! "$(lvs -o attr --noheadings $lv_path)" =~ ....a. ]]; then
                return $OCF_NOT_RUNNING
        fi

        return $OCF_SUCCESS
}


Version-Release number of selected component (if applicable):

resource-agents-3.9.5-24.el6_7.1.x86_64

How reproducible:
Always

Steps to Reproduce:
  The simplest way is to setup some number of LVs:
# lvs
  LV              VG              Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_testnim1     vg_testnim1     -wi-ao----  96.00m
  lv_testnim2     vg_testnim2     -wi-ao----  96.00m
  lv_testnim3     vg_testnim3     -wi-------  96.00m
  lv_testnim4     vg_testnim4     -wi-------  96.00m
  lv_testnim5     vg_testnim5     -wi-------  96.00m
# vgs

  VG              #PV #LV #SN Attr   VSize    VFree
  vg_testnim1       1   1   0 wz--nc   96.00m      0
  vg_testnim2       1   1   0 wz--nc   96.00m      0
  vg_testnim3       1   1   0 wz--nc   96.00m      0
  vg_testnim4       1   1   0 wz--nc   96.00m      0
  vg_testnim5       1   1   0 wz--nc   96.00m      0

and observe the output of the following commands:
export OCF_RESKEY_lv_name=lv_testnim4; export OCF_RESKEY_vg_name=vg_testnim4 ; bash -x /usr/share/cluster/lvm.sh status

Check the attachment for the given example.


Actual results:
 Attributes of every logical volume is printed in the conditional statement, status returns 0 for not active logical volume

Expected results:
 Attributes of only the specified logical volume is printed in the conditional statement, status returns non zero for not active logical volume

Additional info:

This bug possibly puts even more burden on DLM component as cluster locks are communicated n*n times for each lvm status check instead of only n times, depending on the DLM internals.

Comment 2 Alasdair Kergon 2016-03-23 13:11:43 UTC
Also consider variants like
lvs -o active
(instead of using the attr column)
or
lvs -o+active -S active!=""
(doing the matching internally)

Comment 3 Peter Rajnoha 2016-03-23 13:29:49 UTC
For example, this should do it as well:

test -n "$(lvs --noheadings -o active -S active!='' vgname/lvname)"
  (device is active)

-S active!='' will select only LVs for which the "active" field is no empty and hence the LV is active. Similarly the other way round. See also lvs -S help for more info on various fields you can use in selection and then selection operators. The -S|--select option is available since lvm2 v2.02.107.

Comment 4 Peter Rajnoha 2016-03-23 13:41:30 UTC
The "lv_active" field displays string values like "active", "local exclusive", "remote exclusive", "remotely", "locally".

But we have also separate binary fields:
  - lv_active_locally
  - lv_active_remotely
  - lv_active_exclusively

These can display "" (blank string) or string representation of the value which means "yes". If you use new --binary option then you get the result directly as 0 or 1.

So for example:

$ lvs vg -o+active_locally
  LV    VG Attr       LSize ActLocal      
  lvol0 vg -wi------- 4.00m               
  lvol1 vg -wi-a----- 4.00m active locally

$ lvs vg -o+active_locally --binary
  LV    VG Attr       LSize ActLocal  
  lvol0 vg -wi------- 4.00m          0
  lvol1 vg -wi-a----- 4.00m          1

$ lvs vg -o+active_locally -S 'active_locally=1'
  LV    VG Attr       LSize ActLocal      
  lvol1 vg -wi-a----- 4.00m active locally

$lvs vg -o+active_locally -S 'active_locally=1' --binary
  LV    VG Attr       LSize ActLocal  
  lvol1 vg -wi-a----- 4.00m          1

$ lvs vg -o+active_locally -S 'active_locally="active locally"'         
  LV    VG Attr       LSize ActLocal      
  lvol1 vg -wi-a----- 4.00m active locally

$ lvs vg -o+active_locally -S 'active_locally=0'
  LV    VG Attr       LSize ActLocal
  lvol0 vg -wi------- 4.00m 

So it's worth a try as it can simplify scripts a bit and make it more compatible with future versions of lvm2 (e.g. the number of lv_attr bits may change).

Comment 5 Peter Rajnoha 2016-03-23 13:48:18 UTC
Also, if needed, any combinations are possible, for example:

$lvs -S 'active_locally=1 && vg_name=vg && (name=lvol0 || name=lvol1)'

$lvs -S 'active_locally=1 || active_remotely=1'

So any field as listed in lvs -S help. Maybe give it a few tries first and you'll have more complete picture of it and you'll choose yourself what suits best your script.

Comment 7 Oyvind Albrigtsen 2016-09-30 10:33:16 UTC
https://github.com/ClusterLabs/resource-agents/pull/853

Comment 9 michal novacek 2017-01-25 11:31:59 UTC
I have verified that lvm.sh is able to correctly report status for multiple vgs
with resource-agents-3.9.5-43.

----

common setup: 
    * cluster running on all nodes
    * clvmd running an all nodes
    * four cluster vgs created but only two active

[root@virt-009 ~]# vgs
  VG         #PV #LV #SN Attr   VSize   VFree
  vg5          1   1   0 wz--nc 500,00m    0 
  vg6          1   1   0 wz--nc 500,00m    0 
  vg7          1   1   0 wz--nc 500,00m    0 
  vg8          1   1   0 wz--nc 500,00m    0 
  vg_virt009   1   2   0 wz--n-   7,83g    0

[root@virt-009 ~]# lvs
  LV      VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv5     vg5        -wi------- 500,00m                                                    
  lv6     vg6        -wi------- 500,00m                                                    
  lv7     vg7        -wi-a----- 500,00m                                                    
  lv8     vg8        -wi-a----- 500,00m                                                    
  lv_root vg_virt009 -wi-ao----   7,00g                                                    
  lv_swap vg_virt009 -wi-ao---- 852,00m                                                    


before the fix (resource-agents-3.9.5-24.el6)
=============================================

[root@virt-009 ~]# export OCF_RESKEY_lv_name=lv5; 
[root@virt-009 ~]# export OCF_RESKEY_vg_name=vg5 
[root@virt-009 ~]# bash -x /usr/share/cluster/lvm.sh status
...
[lvm.sh] Getting status
+ '[' -z lv5 ']'
+ lv_status
++ vgs -o attr --noheadings vg5
+ [[   wz--nc =~ .....c ]]
+ lv_status_clustered
++ lvs -o attr --noheadings
+ [[ !   -wi-------
  -wi-------
  -wi-a-----
  -wi-a-----
  -wi-ao----
  -wi-ao---- =~ ....a. ]]
>> + return 0
>> + exit 0

after the fix (resource-agents-3.9.5-40.el6)
============================================

[root@virt-009 ~]# export OCF_RESKEY_lv_name=lv5; 
[root@virt-009 ~]# export OCF_RESKEY_vg_name=vg5 
[root@virt-009 ~]# bash -x /usr/share/cluster/lvm.sh status
...
[lvm.sh] Getting status
+ '[' -z lv5 ']'
+ lv_status
++ vgs -o attr --noheadings vg5
+ [[   wz--nc =~ .....c ]]
+ lv_status_clustered
+ declare lv_path=vg5/lv5
++ lvs -o attr --noheadings vg5/lv5
+ [[ !   -wi------- =~ ....a. ]]
>> + return 7
>> + exit 7

Comment 11 errata-xmlrpc 2017-03-21 09:27:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0602.html