This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1902433 - LVM-activate: Monitor operation succeeds when PV has been removed
Summary: LVM-activate: Monitor operation succeeds when PV has been removed
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: resource-agents
Version: 8.3
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 8.4
Assignee: Oyvind Albrigtsen
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-28 22:31 UTC by Reid Wahl
Modified: 2024-10-01 17:08 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-22 19:02:41 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)
test script based on resource-agents-4.1.1-68.el8 plus commits b325f97e and be5e92e1 (24.86 KB, application/x-shellscript)
2020-12-04 02:25 UTC, Reid Wahl
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ClusterLabs resource-agents pull 1600 0 None open LVM-activate: Improve return codes and handling of missing PVs 2021-01-29 18:12:51 UTC
Red Hat Issue Tracker   RHEL-7638 0 None Migrated None 2023-09-22 19:02:35 UTC
Red Hat Knowledge Base (Solution) 5605971 0 None None None 2020-11-28 23:00:47 UTC

Description Reid Wahl 2020-11-28 22:31:41 UTC
Description of problem:

If the managed volume group's underlying PV has been removed, the LVM-activate monitor operation still succeeds.

    [root@fastvm-rhel-8-0-24 ~]# pcs resource config lvm
     Resource: lvm (class=ocf provider=heartbeat type=LVM-activate)
      Attributes: activation_mode=exclusive vg_access_mode=system_id vgname=test_vg1
    
    [root@fastvm-rhel-8-0-24 ~]# pcs resource debug-start lvm
    Operation start for lvm (ocf:heartbeat:LVM-activate) returned: 'ok' (0)
    
    [root@fastvm-rhel-8-0-24 ~]# vgs test_vg1
      VG       #PV #LV #SN Attr   VSize   VFree  
      test_vg1   1   1   0 wz--n- 992.00m 696.00m
    
    [root@fastvm-rhel-8-0-24 ~]# /usr/sbin/iscsiadm -m node --logoutall=all
    Logging out of session [sid: 1, target: iqn.2003-01.org.linux-iscsi.fastvm-rhel-7-6-51.x8664:sn.9677dbd9a870, portal: 192.168.22.51,3260]
    Logout of [sid: 1, target: iqn.2003-01.org.linux-iscsi.fastvm-rhel-7-6-51.x8664:sn.9677dbd9a870, portal: 192.168.22.51,3260] successful.
    
    [root@fastvm-rhel-8-0-24 ~]# vgs test_vg1
      Volume group "test_vg1" not found.
      Cannot process volume group test_vg1
    
    [root@fastvm-rhel-8-0-24 ~]# pcs resource debug-monitor lvm
    Operation monitor for lvm (ocf:heartbeat:LVM-activate) returned: 'ok' (0)


This is because the monitor operation checks `dmsetup info` for the VG. When the PV is abruptly removed, the mapping does not get removed from device-mapper, so the VG still looks active in the dmsetup output.

~~~
        else
                dm_count=$(dmsetup info --noheadings --noflush -c -S "vg_name=${VG}" | grep -c -v '^No devices found')
        fi

        if [ $dm_count -eq 0 ]; then
                return $OCF_NOT_RUNNING
        fi

        return $OCF_SUCCESS
}
~~~

We need to add a more reliable test for the existence of the VG, either in addition to or in place of `dmsetup info`.

Alternatively, if this is considered a bug in dmsetup info, then we should get that fixed. (I assume it's behaving as expected and that there's no mechanism to remove the mapping in this case.)

-----

Version-Release number of selected component (if applicable):

resource-agents-4.1.1-68.el8.x86_64

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Create and start an LVM-activate resource.
2. Remove the managed VG's underlying PV (e.g., log out of the iSCSI session if it's presented via iSCSI).
3. Run the resource's monitor opearation.

-----

Actual results:

Monitor operation succeeds.

-----

Expected results:

Monitor operation fails.

-----

Additional info:

If we make the monitor operation fail, then the stop operation is also likely to fail because the missing VG can't be deactivated (see also BZ 1902208). This probably isn't desirable behavior. If the volume group fails to deactivate **because it doesn't exist**, this should probably be considered a successful stop. However, I could see that point as debatable, since the VG doesn't get to stop cleanly.

Comment 1 Reid Wahl 2020-11-29 02:25:46 UTC
This is sort of a known issue per the discussion in the comments above the LVM_status() function. 

This is addressed upstream in https://github.com/ClusterLabs/resource-agents/commit/b325f97e.

To me it seems better to keep it the way it was in commit a299281a so that an explicit OCF_CHECK_LEVEL=10 is not required... I would think that testing whether the device is active should be a default behavior, which could possibly be disabled by an attribute. Either way, this solves the problem.

-----

There is one possible option that doesn't seem to be addressed in the comments or in the commit listed above: check whether any of a VG's block devices are presented/visible to the OS, **without** attempting I/O to one of the LVs. In the case reported in this BZ, simply checking for the existence of the PV would have been sufficient.

Comment 2 Reid Wahl 2020-11-29 02:26:35 UTC
(In reply to Reid Wahl from comment #1)
> There is one possible option that doesn't seem to be addressed in the
> comments or in the commit listed above: check whether any of a VG's block
> devices are presented/visible to the OS, **without** attempting I/O to one
> of the LVs. In the case reported in this BZ, simply checking for the
> existence of the PV would have been sufficient.

Since LVM commands are off-limits, this would require checking metadata that's saved somewhere like /etc/lvm/backup, if it's a feasible option at all.

Comment 3 Reid Wahl 2020-11-29 02:51:55 UTC
(In reply to Reid Wahl from comment #2)
> Since LVM commands are off-limits, this would require checking metadata
> that's saved somewhere like /etc/lvm/backup, if it's a feasible option at
> all.

I wonder whether parsing for a blkdevname in `dmsetup deps` is a valid test.

Where test_vg1/test_lv1's PV has been removed and r8vg/root_lv's PVs are present, r8vg/root_lv's output has true blkdevnames, while test_vg1/test_lv1 gives only major/minor numbers since the block device does not exist anymore:

~~~
[root@fastvm-rhel-8-0-24 ~]# dmsetup deps -o blkdevname /dev/test_vg1/test_lv1 
1 dependencies	: (8, 16)

[root@fastvm-rhel-8-0-24 ~]# dmsetup deps -o blkdevname /dev/r8vg/root_lv 
2 dependencies	: (sda3) (sda2)
~~~

Comment 5 Milind 2020-12-02 02:33:05 UTC
By any chance if you can use blockdev command ?


iscsiadm -m node -T <target> --portal <portal> -u
Logging out of session [sid: 4, target: , portal: ]
Logout of [sid: 4, target:, portal:] successful.

dmsetup info --noheadings --noflush -c -S "vg_name=vg_failover"
vg_failover-lv_failover:253:10:L--w:1:1:0:LVM-R0vbNuh7C3QBnTJqgV7Kqd395jZzIS1khjMPXqgbAmjsqMrxGVct26KsH1S9cXtY

 dmsetup deps -o devname /dev/vg_failover/lv_failover
1 dependencies  : (8, 240)

 blockdev --report /dev/sdp
RO    RA   SSZ   BSZ   StartSec            Size   Device
blockdev: cannot open /dev/sdp: No such file or directory

blockdev -getro /dev/sdp
blockdev: cannot open /dev/sdp: No such file or directory
[root@gie9viaas136148 lib]#

Comment 6 Reid Wahl 2020-12-02 02:47:32 UTC
(In reply to Milind from comment #5)
> By any chance if you can use blockdev command ?

That depends on knowing what the PVs are. In your example, you used /dev/sdp as an argument to the blockdev command. The difficulty here is that the LVM experts who helped develop the resource agent ruled out using LVM commands such as `pvs` as part of the monitor operation.

So I don't see a straightforward way of getting a VG's list of PVs.

My only idea, besides the patch in https://github.com/ClusterLabs/resource-agents/commit/b325f97e, is to check the output of `dmsetup deps -o blkdevname` and check whether all of the dependencies have commas. The idea is that if a dependency has a comma, the block device is missing; if all the block devices are missing, then all the PVs are missing, and thus the VG is missing.

However, I don't know if this is truly a valid test. I also don't know whether it would break when a volume group contains things like thinpools and mirrors. I've put only limited thought/testing into that approach so far.

It's entirely possible that the existing approach, plus the patch in b325f97e, is the best option despite its limitations.

Comment 7 Milind 2020-12-03 13:53:15 UTC
when do you think we can test it , as it may become show stopper for us and need some kind of solution or workaround

Comment 8 Reid Wahl 2020-12-04 02:24:26 UTC
(In reply to Milind from comment #7)
> when do you think we can test it , as it may become show stopper for us and
> need some kind of solution or workaround

I'll attach a modified resource agent containing commits b325f97e and be5e92e1[1]. You can:
  - Copy it to /usr/lib/ocf/resource.d/custom/LVM-activate.
  - Run `chmod 755 /usr/lib/ocf/resource.d/custom/LVM-activate`.
  - Create a resource of class ocf:custom:LVM-activate with OCF_CHECK_LEVEL=10 for the monitor operation (e.g., `pcs resource create <rsc_name> ocf:custom:LVM-activate <options> op monitor timeout=90s interval=30s OCF_CHECK_LEVEL=10`).
    

[1] https://github.com/ClusterLabs/resource-agents/commit/be5e92e1

Comment 9 Reid Wahl 2020-12-04 02:25:34 UTC
Created attachment 1736286 [details]
test script based on resource-agents-4.1.1-68.el8 plus commits b325f97e and be5e92e1

Place in /usr/lib/ocf/resource.d/custom/LVM-activate.

Comment 10 Milind 2020-12-09 13:34:15 UTC
Hi Reid 
I tested it but not working for me 

here are the  test results 

1) copy the script

[root@gie9viaas136148 custom]# pwd
/usr/lib/ocf/resource.d/custom
[root@gie9viaas136148 custom]# ls -la LVM-activate
-rwxr-xr-x. 1 root root 25456 Dec  8 19:29 LVM-activate
[root@gie9viaas136148 custom]#


 grep OCF_CHECK_LEVEL LVM-activate
Option: OCF_CHECK_LEVEL
If you want deeper tests, set OCF_CHECK_LEVEL to 10:
        case "$OCF_CHECK_LEVEL" in
                        ocf_exit_reason "unsupported monitor level $OCF_CHECK_LEVEL"


2) update resource 

[root@gie9viaas136148 custom]# pcs resource update vg_failover op monitor timeout=90s interval=30s OCF_CHECK_LEVEL=10

[root@gie9viaas136148 custom]# pcs resource show vg_failover
Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead.
 Resource: vg_failover (class=ocf provider=heartbeat type=LVM-activate)
  Attributes: vg_access_mode=system_id vgname=vg_failover
  Operations: monitor interval=30s timeout=90s OCF_CHECK_LEVEL=10 (vg_failover-monitor-interval-30s)
              start interval=0s timeout=90s (vg_failover-start-interval-0s)
              stop interval=0s timeout=90s (vg_failover-stop-interval-0s)
[root@gie9viaas136148 custom]#

3) check current status 

[root@gie9viaas136148 custom]# pcs status
Cluster name: NA_DEV_GIELAB
Cluster Summary:
  * Stack: corosync
  * Current DC: gie9viaas136149 (version 2.0.3-5.el8-4b1f869f0f) - partition with quorum
  * Last updated: Wed Dec  9 13:14:36 2020
  * Last change:  Wed Dec  9 13:13:49 2020 by root via cibadmin on gie9viaas136148
  * 2 nodes configured
  * 16 resource instances configured

Node List:
  * Online: [ gie9viaas136148 gie9viaas136149 ]

Full List of Resources:
  * kdump       (stonith:fence_kdump):  Started gie9viaas136149
  * scsi        (stonith:fence_scsi):   Started gie9viaas136149
  * Resource Group: failovergroup:
    * vg_failover       (ocf::heartbeat:LVM-activate):  Started gie9viaas136148
    * www_fs    (ocf::heartbeat:Filesystem):    Started gie9viaas136148
    * VirtualIP (ocf::heartbeat:IPaddr2):       Started gie9viaas136148
    * Website   (ocf::heartbeat:apache):        Started gie9viaas136148
  * Clone Set: clvm-clone [clvm]:
    * Started: [ gie9viaas136148 gie9viaas136149 ]
  * Clone Set: sharedgroup-clone [sharedgroup]:
    * Started: [ gie9viaas136148 gie9viaas136149 ]
  * Resource Group: mirrorgroup:
    * vg_mirror (ocf::heartbeat:LVM-activate):  Started gie9viaas136148
    * mirror_fs (ocf::heartbeat:Filesystem):    Started gie9viaas136148

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@gie9viaas136148 custom]#


4) logout of device 

Logging out of session [sid: 4, target: iqn.2010-01.com.solidfire:n9ai.gie9viaas136148-49-4.1970, portal: 169.78.5.10,3260]
Logout of [sid: 4, target: iqn.2010-01.com.solidfire:n9ai.gie9viaas136148-49-4.1970, portal: 169.78.5.10,3260] successful.

[root@gie9viaas136148 custom]# ls /dev/sdp
ls: cannot access '/dev/sdp': No such file or directory

[root@gie9viaas136148 custom]# dmsetup info /dev/vg_failover
Device vg_failover not found
Command failed.
[root@gie9viaas136148 custom]# dmsetup info /dev/vg_failover/lv_failover
Name:              vg_failover-lv_failover
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 10
Number of targets: 1
UUID: LVM-R0vbNuh7C3QBnTJqgV7Kqd395jZzIS1khjMPXqgbAmjsqMrxGVct26KsH1S9cXtY

[root@gie9viaas136148 custom]# dmsetup deps /dev/vg_failover/lv_failover
1 dependencies  : (8, 240)


5) check cluster status 

[root@gie9viaas136148 custom]# pcs status
Cluster name: NA_DEV_GIELAB
Cluster Summary:
  * Stack: corosync
  * Current DC: gie9viaas136149 (version 2.0.3-5.el8-4b1f869f0f) - partition with quorum
  * Last updated: Wed Dec  9 13:20:08 2020
  * Last change:  Wed Dec  9 13:13:49 2020 by root via cibadmin on gie9viaas136148
  * 2 nodes configured
  * 16 resource instances configured

Node List:
  * Online: [ gie9viaas136148 gie9viaas136149 ]

Full List of Resources:
  * kdump       (stonith:fence_kdump):  Started gie9viaas136149
  * scsi        (stonith:fence_scsi):   Started gie9viaas136149
  * Resource Group: failovergroup:
    * vg_failover       (ocf::heartbeat:LVM-activate):  Started gie9viaas136148
    * www_fs    (ocf::heartbeat:Filesystem):    Started gie9viaas136148
    * VirtualIP (ocf::heartbeat:IPaddr2):       Started gie9viaas136148
    * Website   (ocf::heartbeat:apache):        Started gie9viaas136148
  * Clone Set: clvm-clone [clvm]:
    * Started: [ gie9viaas136148 gie9viaas136149 ]
  * Clone Set: sharedgroup-clone [sharedgroup]:
    * Started: [ gie9viaas136148 gie9viaas136149 ]
  * Resource Group: mirrorgroup:
    * vg_mirror (ocf::heartbeat:LVM-activate):  Started gie9viaas136148
    * mirror_fs (ocf::heartbeat:Filesystem):    Started gie9viaas136148

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@gie9viaas136148 custom]# pcs status
Cluster name: NA_DEV_GIELAB
Cluster Summary:
  * Stack: corosync
  * Current DC: gie9viaas136149 (version 2.0.3-5.el8-4b1f869f0f) - partition with quorum
  * Last updated: Wed Dec  9 13:20:10 2020

---- No failure 

7) run debug monitor

 pcs resource debug-monitor vg_failover --full
Operation monitor for vg_failover (ocf:heartbeat:LVM-activate) returned: 'ok' (0)
 >  stderr: +++ 13:22:05: ocf_start_trace:1000: echo
 >  stderr: +++ 13:22:05: ocf_start_trace:1000: printenv
 >  stderr: +++ 13:22:05: ocf_start_trace:1000: sort
 >  stderr: ++ 13:22:05: ocf_start_trace:1000: env='
 >  stderr: COBBLER_SERVER=10.104.194.32
 >  stderr: DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/0/bus
 >  stderr: HA_debug=1
 >  stderr: HA_logfacility=none
 >  stderr: HISTCONTROL=ignoredups
 >  stderr: HISTSIZE=1000
 >  stderr: HOME=/root
 >  stderr: HOSTNAME=gie9viaas136148.gielab.jpmchase.net
 >  stderr: LC_ALL=C
 >  stderr: LESSOPEN=||/usr/bin/lesspipe.sh %s
 >  stderr: LOGNAME=root
 >  stderr: LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36:
 >  stderr: MAIL=/var/spool/mail/root
 >  stderr: OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
 >  stderr: OCF_RA_VERSION_MAJOR=1
 >  stderr: OCF_RA_VERSION_MINOR=0
 >  stderr: OCF_RESKEY_CRM_meta_class=ocf
 >  stderr: OCF_RESKEY_CRM_meta_id=vg_failover
 >  stderr: OCF_RESKEY_CRM_meta_migration_threshold=3
 >  stderr: OCF_RESKEY_CRM_meta_provider=heartbeat
 >  stderr: OCF_RESKEY_CRM_meta_resource_stickiness=100
 >  stderr: OCF_RESKEY_CRM_meta_timeout=90000
 >  stderr: OCF_RESKEY_CRM_meta_type=LVM-activate
 >  stderr: OCF_RESKEY_crm_feature_set=3.3.0
 >  stderr: OCF_RESKEY_vg_access_mode=system_id
 >  stderr: OCF_RESKEY_vgname=vg_failover
 >  stderr: OCF_RESOURCE_INSTANCE=vg_failover
 >  stderr: OCF_RESOURCE_PROVIDER=heartbeat
 >  stderr: OCF_RESOURCE_TYPE=LVM-activate
 >  stderr: OCF_ROOT=/usr/lib/ocf
 >  stderr: OCF_TRACE_FILE=/dev/stderr
 >  stderr: OCF_TRACE_RA=1
 >  stderr: OLDPWD=/usr/lib/ocf/resource.d
 >  stderr: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
 >  stderr: PCMK_logfacility=none
 >  stderr: PCMK_service=crm_resource
 >  stderr: PWD=/usr/lib/ocf/resource.d/custom
 >  stderr: SELINUX_LEVEL_REQUESTED=
 >  stderr: SELINUX_ROLE_REQUESTED=
 >  stderr: SELINUX_USE_CURRENT_RANGE=
 >  stderr: SHELL=/bin/bash
 >  stderr: SHLVL=2
 >  stderr: SSH_CLIENT=10.108.10.155 36766 22
 >  stderr: SSH_CONNECTION=10.108.10.155 36766 10.104.136.148 22
 >  stderr: SSH_TTY=/dev/pts/0
 >  stderr: S_COLORS=auto
 >  stderr: TERM=xterm
 >  stderr: USER=root
 >  stderr: XDG_RUNTIME_DIR=/run/user/0
 >  stderr: XDG_SESSION_ID=3
 >  stderr: _=/usr/bin/printenv
 >  stderr: __OCF_TRC_DEST=/dev/stderr
 >  stderr: __OCF_TRC_MANAGE='
 >  stderr: ++ 13:22:05: source:1054: ocf_is_true ''
 >  stderr: ++ 13:22:05: ocf_is_true:103: case "$1" in
 >  stderr: ++ 13:22:05: ocf_is_true:103: case "$1" in
 >  stderr: ++ 13:22:05: ocf_is_true:105: false
 >  stderr: + 13:22:05: main:46: OCF_RESKEY_partial_activation_default=false
 >  stderr: + 13:22:05: main:48: : false
 >  stderr: + 13:22:05: main:52: VG=vg_failover
 >  stderr: + 13:22:05: main:53: LV=
 >  stderr: + 13:22:05: main:63: VG_access_mode=system_id
 >  stderr: + 13:22:05: main:64: VG_access_mode_num=0
 >  stderr: + 13:22:05: main:68: LV_activation_mode=exclusive
 >  stderr: + 13:22:05: main:71: SYSTEM_ID=
 >  stderr: + 13:22:05: main:74: OUR_TAG=pacemaker
 >  stderr: + 13:22:05: main:873: case $__OCF_ACTION in
 >  stderr: + 13:22:05: main:883: lvm_status
 >  stderr: + 13:22:05: lvm_status:770: local dm_count
 >  stderr: + 13:22:05: lvm_status:772: '[' -n '' ']'
 >  stderr: ++ 13:22:05: lvm_status:777: dmsetup info --noheadings --noflush -c -S vg_name=vg_failover
 >  stderr: ++ 13:22:05: lvm_status:777: grep -c -v '^No devices found'
 >  stderr: + 13:22:05: lvm_status:777: dm_count=1
 >  stderr: + 13:22:05: lvm_status:780: '[' 1 -eq 0 ']'
 >  stderr: + 13:22:05: lvm_status:784: return 0
 >  stderr: + 13:22:05: main:899: rc=0
 >  stderr: + 13:22:05: main:901: ocf_log debug 'vg_failover monitor : 0'
 >  stderr: + 13:22:05: ocf_log:323: '[' 2 -lt 2 ']'
 >  stderr: + 13:22:05: ocf_log:327: __OCF_PRIO=debug
 >  stderr: + 13:22:05: ocf_log:328: shift
 >  stderr: + 13:22:05: ocf_log:329: __OCF_MSG='vg_failover monitor : 0'
 >  stderr: + 13:22:05: ocf_log:331: case "${__OCF_PRIO}" in
 >  stderr: + 13:22:05: ocf_log:336: __OCF_PRIO=DEBUG
 >  stderr: + 13:22:05: ocf_log:340: '[' DEBUG = DEBUG ']'
 >  stderr: + 13:22:05: ocf_log:341: ha_debug 'DEBUG: vg_failover monitor : 0'
 >  stderr: + 13:22:05: ha_debug:260: '[' x1 = x0 ']'
 >  stderr: + 13:22:05: ha_debug:260: '[' -z 1 ']'
 >  stderr: + 13:22:05: ha_debug:263: tty
 >  stderr: + 13:22:05: ha_debug:272: set_logtag
 >  stderr: + 13:22:05: set_logtag:177: '[' -z '' ']'
 >  stderr: + 13:22:05: set_logtag:178: '[' -n vg_failover ']'
 >  stderr: + 13:22:05: set_logtag:179: HA_LOGTAG='LVM-activate(vg_failover)[2207441]'
 >  stderr: + 13:22:05: ha_debug:274: '[' x = xyes ']'
 >  stderr: + 13:22:05: ha_debug:281: '[' none = '' ']'
 >  stderr: + 13:22:05: ha_debug:284: '[' -n '' ']'
 >  stderr: + 13:22:05: ha_debug:290: '[' -n /dev/null ']'
 >  stderr: + 13:22:05: ha_debug:292: : appending to /dev/null
 >  stderr: ++ 13:22:05: ha_debug:293: hadate
 >  stderr: ++ 13:22:05: hadate:173: date '+%b %d %T '
 >  stderr: + 13:22:05: ha_debug:293: echo 'LVM-activate(vg_failover)[2207441]: Dec' 09 13:22:05 'DEBUG: vg_failover monitor : 0'
 >  stderr: + 13:22:05: ha_debug:296: '[' -z '' -a -z /dev/null ']'
 >  stderr: + 13:22:05: main:902: exit 0
[root@gie9viaas136148 custom]#


8) do dd test , fs is readonly ... still available 

[root@gie9viaas136148 custom]# dd if=/dev/zero of=/var/www/xxx oflag=direct count=1 bs=1
dd: failed to open '/var/www/xxx': Read-only file system


 dd if=/var/www/xxx of=/dev/null count=1 bs=1
0+0 records in
0+0 records out
0 bytes copied, 4.7114e-05 s, 0.0 kB/s

Comment 11 Milind 2020-12-09 15:17:57 UTC
i am not sure if this is valid test as
1) when device is lost the filesystem is still available and will become read only do dd readonly will always pass
2) evne for a second i assume this test will work , how it will behave in case I deliberately mount a FS with RO options ?

Comment 12 Reid Wahl 2020-12-14 05:09:11 UTC
(In reply to Milind from comment #10)
> Hi Reid 
> I tested it but not working for me 
> 
> here are the  test results 
> ...
> 2) update resource 
> 
> [root@gie9viaas136148 custom]# pcs resource update vg_failover op monitor
> timeout=90s interval=30s OCF_CHECK_LEVEL=10
> 
> [root@gie9viaas136148 custom]# pcs resource show vg_failover
> Warning: This command is deprecated and will be removed. Please use 'pcs
> resource config' instead.
>  Resource: vg_failover (class=ocf provider=heartbeat type=LVM-activate)
>   Attributes: vg_access_mode=system_id vgname=vg_failover
>   Operations: monitor interval=30s timeout=90s OCF_CHECK_LEVEL=10
> (vg_failover-monitor-interval-30s)
>               start interval=0s timeout=90s (vg_failover-start-interval-0s)
>               stop interval=0s timeout=90s (vg_failover-stop-interval-0s)
> [root@gie9viaas136148 custom]#

There's the problem. Step 3 was "Create a resource of class ocf:custom:LVM-activate with OCF_CHECK_LEVEL=10 for the monitor operation (e.g., `pcs resource create <rsc_name> ocf:custom:LVM-activate <options> op monitor timeout=90s interval=30s OCF_CHECK_LEVEL=10`)."

Instead, you updated a resource of class ocf:heartbeat:LVM-activate. The `pcs resource update` command doesn't convert vg_failover into an ocf:custom:LVM-activate resource.

So it's still using the old script.

-----

(In reply to Milind from comment #11)
> i am not sure if this is valid test as
> 1) when device is lost the filesystem is still available and will become
> read only do dd readonly will always pass
> 2) evne for a second i assume this test will work , how it will behave in
> case I deliberately mount a FS with RO options ?

The filesystem is not involved in this test at all. The test performs a read operation directly from the logical volume.

Comment 13 Milind 2020-12-18 23:20:56 UTC
Hi Reid 
I Fixed the configuration created new vg and lvs with correct agent 
  pcs resource config vg_test
 Resource: vg_test (class=ocf provider=custom type=LVM-activate)
  Attributes: vg_access_mode=system_id vgname=vg_test
  Operations: monitor interval=30s timeout=90s OCF_CHECK_LEVEL=10 (vg_test-monitor-interval-30s)
              start interval=0s timeout=90s (vg_test-start-interval-0s)
              stop interval=0s timeout=90s (vg_test-stop-interval-0s)
[root@gie9viaas136149 custom]#


I modified code a little bit to random pick volume under check level 10

                        dm_num=$( shuf -i 1-${dm_count} -n 1 -r)
                        echo ${dm_num}
                        dm_name="/dev/${VG}/$(ls -1 /dev/${VG} | head -n 1)"
                        dm_name_r="/dev/${VG}/$(ls -1 /dev/${VG} | head -n ${dm_num} | tail -n 1)"
                        echo ${dm_name_r}

                        # read 1 byte to check the dev is alive
                        #dd if=${dm_name} of=/dev/null bs=1 count=1 >/dev/null \
                        dd if=${dm_name_r} of=/dev/null bs=1 count=1 >/dev/null \
                                2>&1
                        if [ $? -ne 0 ]; then
                                return $OCF_NOT_RUNNING
                        else
                                return $OCF_SUCCESS
                        fi
                        ;;

Check status of resources ( out side not in code) - testing 

[root@gie9viaas136149 custom]# lvs --noheading -olv_name /dev/vg_test
  lv_test1
  lv_test2
  lv_test3
  lv_test4
[root@gie9viaas136149 custom]# lvs --noheading -olv_name,devices /dev/vg_test
  lv_test1 /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007af(0)
  lv_test2 /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007af(100)
  lv_test3 /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b0(0)
  lv_test4 /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b1(0)
  lv_test4 /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b0(100)
  lv_test4 /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007af(200)

[root@gie9viaas136149 custom]# pvs --noheading -opv_name,lv_name,vg_name
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000005a9
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000005aa
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000005ab
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007af lv_test1         vg_test
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007af lv_test2         vg_test
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007af lv_test4         vg_test
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b0 lv_test3         vg_test
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b0 lv_test4         vg_test
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b1 lv_test4         vg_test


Do dd test 

# ls -1 /dev/vg_test
lv_test1
lv_test2
lv_test3
lv_test4
[root@gie9viaas136149 custom]# dd if=/dev/vg_test/lv_test1 bs=1 count=1
1+0 records in
1+0 records out
1 byte copied, 2.8183e-05 s, 35.5 kB/s
[root@gie9viaas136149 custom]# dd if=/dev/vg_test/lv_test2 bs=1 count=1
1+0 records in
1+0 records out
1 byte copied, 2.7589e-05 s, 36.2 kB/s
[root@gie9viaas136149 custom]# dd if=/dev/vg_test/lv_test3 bs=1 count=1
1+0 records in
1+0 records out
1 byte copied, 2.6817e-05 s, 37.3 kB/s
[root@gie9viaas136149 custom]# dd if=/dev/vg_test/lv_test4 bs=1 count=1
1+0 records in
1+0 records out
1 byte copied, 3.6667e-05 s, 27.3 kB/s

Logout of lun pvs , lvs show Lun is gone 

iscsiadm -m node -T iqn.2010-01.com.solidfire:n9ai.gie9viaas136148-49-1.1967 -P 169.78.5.10 -u
Logging out of session [sid: 1, target: iqn.2010-01.com.solidfire:n9ai.gie9viaas136148-49-1.1967, portal: 169.78.5.10,3260]
Logout of [sid: 1, target: iqn.2010-01.com.solidfire:n9ai.gie9viaas136148-49-1.1967, portal: 169.78.5.10,3260] successful.
[root@gie9viaas136149 custom]# lvs --noheading -olv_name,devices /dev/vg_test
  WARNING: Couldn't find device with uuid cEEzpV-Cyca-R2nk-IynY-eaI6-gMxz-SEkobd.
  WARNING: VG vg_test is missing PV cEEzpV-Cyca-R2nk-IynY-eaI6-gMxz-SEkobd (last written to /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007af).
  lv_test1 [unknown](0)
  lv_test2 [unknown](100)
  lv_test3 /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b0(0)
  lv_test4 /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b1(0)
  lv_test4 /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b0(100)
  lv_test4 [unknown](200)
[root@gie9viaas136149 custom]# pvs --noheading -opv_name,lv_name,vg_name
  WARNING: Couldn't find device with uuid cEEzpV-Cyca-R2nk-IynY-eaI6-gMxz-SEkobd.
  WARNING: VG vg_test is missing PV cEEzpV-Cyca-R2nk-IynY-eaI6-gMxz-SEkobd (last written to /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007af).
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000005a9
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000005aa
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000005ab
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b0 lv_test3         vg_test
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b0 lv_test4         vg_test
  /dev/disk/by-id/wwn-0x6f47acc1000000006e396169000007b1 lv_test4         vg_test

Run the check check is till passing and resource is still up 


[root@gie9viaas136149 custom]#  OCF_CHECK_LEVEL=10 pcs resource debug-monitor vg_test  --full | grep -A4 -B4 dd
 >  stderr: ++ 22:21:58: lvm_status:792: ls -1 /dev/vg_test
 >  stderr: ++ 22:21:58: lvm_status:792: tail -n 1
 >  stderr: + 22:21:58: lvm_status:792: dm_name_r=/dev/vg_test/lv_test1
 >  stderr: + 22:21:58: lvm_status:793: echo /dev/vg_test/lv_test1
 >  stderr: + 22:21:58: lvm_status:797: dd if=/dev/vg_test/lv_test1 of=/dev/null bs=1 count=1
 >  stderr: + 22:21:58: lvm_status:799: '[' 0 -ne 0 ']'
 >  stderr: + 22:21:58: lvm_status:802: return 0
 >  stderr: + 22:21:58: main:941: rc=0
 >  stderr: + 22:21:58: main:943: ocf_log debug 'vg_test monitor : 0'
[root@gie9viaas136149 custom]#  OCF_CHECK_LEVEL=10 pcs resource debug-monitor vg_test  --full | grep -A4 -B4 dd
 >  stderr: ++ 22:22:01: lvm_status:792: tail -n 1
 >  stderr: ++ 22:22:01: lvm_status:792: head -n 2
 >  stderr: + 22:22:01: lvm_status:792: dm_name_r=/dev/vg_test/lv_test2
 >  stderr: + 22:22:01: lvm_status:793: echo /dev/vg_test/lv_test2
 >  stderr: + 22:22:01: lvm_status:797: dd if=/dev/vg_test/lv_test2 of=/dev/null bs=1 count=1
 >  stderr: + 22:22:01: lvm_status:799: '[' 0 -ne 0 ']'
 >  stderr: + 22:22:01: lvm_status:802: return 0
 >  stderr: + 22:22:01: main:941: rc=0
 >  stderr: + 22:22:01: main:943: ocf_log debug 'vg_test monitor : 0'
[root@gie9viaas136149 custom]#



Am I doing something wrong in configuration , the group is not still failing over

Comment 17 Milind 2021-04-13 11:58:12 UTC
what is the update on this bug ?

Comment 23 Jonathan Earl Brassow 2021-05-04 13:23:53 UTC
hmmm,

I don't think LVM devels ruled out using LVM commands - they recommend not using them.  The reason for this is because they are slower and more prone to hangs than their simplified, more direct 'dmsetup' counterparts.  LVM commands need to read the labels from the disks and perform protective locking - dmsetup simply queries the kernel.  So ideally, you don't want to use the command that could get hung-up on IO or wait for a problem with another command (i.e. locking issues).

In /this/ case, however, you /want/ something to go to disk and check the label to see if the device is alive.  That's what you miss by using the dmsetup command.  You still can avoid using heavy weighted LVM commands by choosing something else to read the device.

Corey is also right - not all failed devices are a reason to shutdown an LVM resource.  RAID LVs are a good example.  Once a problem is identified, you are almost forced to use an LVM command to determine if an LV is affected - checking the 'health_status' field to determine if the device is 'partial', 'degraded', or good.

Comment 24 Jonathan Earl Brassow 2021-05-04 13:33:51 UTC
While I'm at it...  If you are doing a 'monitor', it would be nice if transient errors were also identified.  See 'lvs' man page:
       9  Volume Health, where there are currently three groups of attributes identified:

          Common ones for all Logical Volumes: (p)artial, (X) unknown.
          (p)artial signifies that one or more of the Physical Volumes this Logical Volume uses  is  missing  from  the
          system. (X) unknown signifies the status is unknown.

          Related to RAID Logical Volumes: (r)efresh needed, (m)ismatches exist, (w)ritemostly.
          (r)efresh  signifies  that  one  or more of the Physical Volumes this RAID Logical Volume uses had suffered a
          write error. The write error could be due to a temporary failure of that Physical  Volume  or  an  indication
          that it is failing.  The device should be refreshed or replaced. (m)ismatches signifies that the RAID logical
          volume has portions of the array that are not coherent.  Inconsistencies are detected by initiating a "check"
          on  a RAID logical volume.  (The scrubbing operations, "check" and "repair", can be performed on a RAID logi‐
          cal volume via the 'lvchange' command.)  (w)ritemostly signifies the devices in a RAID 1 logical volume  that
          have been marked write-mostly.  (R)emove after reshape signifies freed striped raid images to be removed.

          Related to Thin pool Logical Volumes: (F)ailed, out of (D)ata space, (M)etadata read only.
          (F)ailed  is  set  if thin pool encounters serious failures and hence no further I/O is permitted at all. The
          out of (D)ata space is set if thin pool has run out of data space. (M)etadata read only signifies  that  thin
          pool  encounters  certain  types  of  failures  but it's still possible to do reads at least, but no metadata
          changes are allowed.

          Related to Thin Logical Volumes: (F)ailed.
          (F)ailed is set when related thin pool enters Failed state and no further I/O is permitted at all.

... but be careful.  Don't just go added LVM commands if there is a simpler, more responsive way.  (and a lot of this gets complex - we probably don't want to overdo it)

Comment 25 Reid Wahl 2021-05-04 17:38:34 UTC
Thanks for all of your input! Responses below.

(In reply to Jonathan Earl Brassow from comment #23)
> hmmm,
> 
> I don't think LVM devels ruled out using LVM commands - they recommend not
> using them.

IIRC my main source for my statement is:

~~~
# Eric:
...
# It works, but we cannot afford to use LVM command in lvm_status. LVM command is expensive
# because it may potencially scan all disks on the system, update the metadata even using
# lvs/vgs when the metadata is somehow inconsistent.
~~~

Ref: https://github.com/ClusterLabs/resource-agents/blame/v4.8.0/heartbeat/LVM-activate#L757-L759



> In /this/ case, however, you /want/ something to go to disk and check the
> label to see if the device is alive.  That's what you miss by using the
> dmsetup command.  You still can avoid using heavy weighted LVM commands by
> choosing something else to read the device.

I'm wary of adding any LVM commands that would be run during the monitor operation, heavyweight or not.

You may recall that a volume_group_check_only option was added to the legacy LVM resource agent (the only supported agent on RHEL 7). When this option is enabled, it skips several LVM commands to avoid the possibility of a hang. It basically only tests for the /dev/<vg> directory.

However, there is a fatal flaw. It still runs `vgchange --version` to get the version and runs `vgs -o attr --noheadings $OCF_RESKEY_volgrpname` to check whether the clustered bit is set. I work in support, and I have seen both of these commands hang and cause resource timeouts in customer environments where volume_group_check_only was set to true in order to try to avoid issues like this.

I would expect that `vgchange --version` is a pretty lightweight command (although maybe I'm wrong). Yet even it has hung in our experience.

So my feeling is that introducing any LVM command into the LVM-activate status operation would be a regression in reliability.

I'm far from an LVM expert, so take everything I said with a grain of salt and know that I'll defer to you folks.



> Corey is also right - not all failed devices are a reason to shutdown an LVM
> resource.  RAID LVs are a good example.  Once a problem is identified, you
> are almost forced to use an LVM command to determine if an LV is affected -
> checking the 'health_status' field to determine if the device is 'partial',
> 'degraded', or good.

This is where I would (currently) use the existing partial_activation resource option. It might be both feasible and desirable to make these checks more granular (e.g., your transient error example also).

My reluctance to add LVM commands remains, however. We don't want the resource agent to hang on an LVM command and throw a timeout error when there's nothing actually wrong from the perspective of whatever is using the volumes.

Comment 31 RHEL Program Management 2023-09-22 19:00:44 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 32 RHEL Program Management 2023-09-22 19:02:41 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.


Note You need to log in before you can comment on or make changes to this bug.