Bug 2214350

Summary: HA LVM raid resource with systemid mode is unable to fail over with a device failure
Product: Red Hat Enterprise Linux 9 Reporter: Corey Marthaler <cmarthal>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 9.3CC: agk, cluster-maint, fdinitto, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-12 09:49:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2023-06-12 17:56:21 UTC
Description of problem:
[root@virt-494 ~]# grep system_id_source /etc/lvm/lvm.conf 
        system_id_source = uname     # edited by QA Fri Jun  9 18:51:05 2023

Creating single VG STSRHTS23945 out of /dev/sdf1 /dev/sde1 /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sdc1
virt-495: lvmdevices --adddev /dev/sdf1
virt-495: lvmdevices --adddev /dev/sde1
virt-495: lvmdevices --adddev /dev/sda1
virt-495: lvmdevices --adddev /dev/sdb1
virt-495: lvmdevices --adddev /dev/sdd1
virt-495: lvmdevices --adddev /dev/sdc1

Creating HA raid1 LV(s) and ext4 filesystems on VG STSRHTS23945
        lvcreate --yes --activate y --type raid1 --nosync -L 8G -n lv1 STSRHTS23945
        Verify STSRHTS23945/lv1 systemid:  virt-494.cluster-qe.lab.eng.brq.redhat.com
Creating ext4 filesystem
        mkfs.ext4 /dev/STSRHTS23945/lv1
mke2fs 1.46.5 (30-Dec-2021)

pcs resource create STSRHTS23945 --group HA_STSRHTS23945 ocf:heartbeat:LVM-activate vgname="STSRHTS23945" activation_mode=exclusive vg_access_mode=system_id
pcs resource create fs1 --group HA_STSRHTS23945 ocf:heartbeat:Filesystem device="/dev/STSRHTS23945/lv1" directory="/mnt/fs1" fstype="ext4" "options=noatime" op monitor interval=10s

Running cleanup to fix any potential timing issues during setup
pcs resource cleanup
Cleaned up all resources on all nodes

Checking status of resources on all nodes
Filesystem:fs1 on LVM-activate:STSRHTS23945 (in group: HA_STSRHTS23945)
        Current owner for fs1 is virt-494

Enabling automatic startup
pcs cluster enable --all
virt-494: Cluster Enabled
virt-495: Cluster Enabled


[root@virt-494 ~]# pcs status
Cluster name: STSRHTS23945
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: virt-495 (version 2.1.6-2.el9-6fdc9deea29) - partition with quorum
  * Last updated: Mon Jun 12 18:20:52 2023 on virt-494
  * Last change:  Mon Jun 12 18:18:08 2023 by hacluster via crmd on virt-495
  * 2 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ virt-494 virt-495 ]

Full List of Resources:
  * fence-virt-494      (stonith:fence_xvm):     Started virt-495
  * fence-virt-495      (stonith:fence_xvm):     Started virt-495
  * Resource Group: HA_STSRHTS23945:
    * STSRHTS23945      (ocf:heartbeat:LVM-activate):    Started virt-494
    * fs1       (ocf:heartbeat:Filesystem):      Started virt-494

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@virt-494 ~]# df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/STSRHTS23945-lv1     7.8G   24K  7.4G   1% /mnt/fs1

[root@virt-494 ~]# lvs -a -o +devices
  LV             VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                        
  lv1            STSRHTS23945  Rwi-aor---   8.00g                                    100.00           lv1_rimage_0(0),lv1_rimage_1(0)
  [lv1_rimage_0] STSRHTS23945  iwi-aor---   8.00g                                                     /dev/sdc1(1)                   
  [lv1_rimage_1] STSRHTS23945  iwi-aor---   8.00g                                                     /dev/sdf1(1)                   
  [lv1_rmeta_0]  STSRHTS23945  ewi-aor---   4.00m                                                     /dev/sdc1(0)                   
  [lv1_rmeta_1]  STSRHTS23945  ewi-aor---   4.00m                                                     /dev/sdf1(0)                   

[root@virt-494 ~]# vgs -a -o +vg_systemid
  VG            #PV #LV #SN Attr   VSize   VFree    System ID                                 
  STSRHTS23945    6   1   0 wz--n- 449.95g <433.95g virt-494.cluster-qe.lab.eng.brq.redhat.com



# virt-495
[root@virt-495 ~]# echo offline > /sys/block/sdf/device/state
[root@virt-495 ~]# 

# virt-494
[root@virt-494 ~]# reboot -fin
Rebooting.




[root@virt-495 ~]# pcs status
Cluster name: STSRHTS23945
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: virt-495 (version 2.1.6-2.el9-6fdc9deea29) - partition with quorum
  * Last updated: Mon Jun 12 18:27:07 2023 on virt-495
  * Last change:  Mon Jun 12 18:18:08 2023 by hacluster via crmd on virt-495
  * 2 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ virt-495 ]
  * OFFLINE: [ virt-494 ]

Full List of Resources:
  * fence-virt-494      (stonith:fence_xvm):     Started virt-495
  * fence-virt-495      (stonith:fence_xvm):     Started virt-495
  * Resource Group: HA_STSRHTS23945:
    * STSRHTS23945      (ocf:heartbeat:LVM-activate):    Stopped
    * fs1       (ocf:heartbeat:Filesystem):      Stopped

Failed Resource Actions:
  * STSRHTS23945 start on virt-495 returned 'error' (STSRHTS23945: failed to activate.) at Mon Jun 12 18:26:28 2023 after 502ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled



Jun 12 18:26:20 virt-495 pacemaker-attrd[272313]: notice: Node virt-494 state is now lost
Jun 12 18:26:20 virt-495 pacemaker-attrd[272313]: notice: Removing all virt-494 attributes for peer loss
Jun 12 18:26:20 virt-495 pacemaker-attrd[272313]: notice: Purged 1 peer with id=1 and/or uname=virt-494 from the membership cache
Jun 12 18:26:20 virt-495 pacemaker-fenced[272311]: notice: Node virt-494 state is now lost
Jun 12 18:26:20 virt-495 pacemaker-fenced[272311]: notice: Purged 1 peer with id=1 and/or uname=virt-494 from the membership cache
Jun 12 18:26:20 virt-495 pacemaker-based[272310]: notice: Node virt-494 state is now lost
Jun 12 18:26:20 virt-495 pacemaker-based[272310]: notice: Purged 1 peer with id=1 and/or uname=virt-494 from the membership cache
Jun 12 18:26:20 virt-495 pacemaker-controld[272315]: warning: Stonith/shutdown of node virt-494 was not expected
Jun 12 18:26:20 virt-495 pacemaker-controld[272315]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jun 12 18:26:20 virt-495 pacemaker-controld[272315]: notice: Node virt-494 state is now lost
Jun 12 18:26:20 virt-495 pacemaker-controld[272315]: warning: Stonith/shutdown of node virt-494 was not expected
Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: Cluster node virt-494 will be fenced: peer is no longer part of the cluster
Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: virt-494 is unclean
Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: STSRHTS23945_stop_0 on virt-494 is unrunnable (node is offline)
Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: fs1_stop_0 on virt-494 is unrunnable (node is offline)
Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: Scheduling node virt-494 for fencing
Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Fence (reboot) virt-494 'peer is no longer part of the cluster'
Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Move       STSRHTS23945       ( virt-494 -> virt-495 )
Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Move       fs1                ( virt-494 -> virt-495 )
Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: Calculated transition 298 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-1.bz2
Jun 12 18:26:21 virt-495 pacemaker-controld[272315]: notice: Requesting fencing (reboot) targeting node virt-494
Jun 12 18:26:21 virt-495 pacemaker-fenced[272311]: notice: Client pacemaker-controld.272315 wants to fence (reboot) virt-494 using any device
Jun 12 18:26:21 virt-495 pacemaker-fenced[272311]: notice: Requesting peer fencing (reboot) targeting virt-494
Jun 12 18:26:21 virt-495 pacemaker-fenced[272311]: notice: Requesting that virt-495 perform 'reboot' action targeting virt-494
Jun 12 18:26:28 virt-495 fence_xvm[506038]: Domain "virt-494.cluster-qe.lab.eng.brq.redhat.com" is ON
Jun 12 18:26:28 virt-495 pacemaker-fenced[272311]: notice: Operation 'reboot' [506038] targeting virt-494 using fence-virt-494 returned 0
Jun 12 18:26:28 virt-495 pacemaker-fenced[272311]: notice: Operation 'reboot' targeting virt-494 by virt-495 for pacemaker-controld.272315@virt-495: OK (complete)
Jun 12 18:26:28 virt-495 pacemaker-controld[272315]: notice: Initiating start operation STSRHTS23945_start_0 locally on virt-495
Jun 12 18:26:28 virt-495 pacemaker-controld[272315]: notice: Peer virt-494 was terminated (reboot) by virt-495 on behalf of pacemaker-controld.272315@virt-495: OK
Jun 12 18:26:28 virt-495 pacemaker-controld[272315]: notice: Requesting local execution of start operation for STSRHTS23945 on virt-495
Jun 12 18:26:29 virt-495 LVM-activate(STSRHTS23945)[506047]: INFO: Activating STSRHTS23945
Jun 12 18:26:29 virt-495 LVM-activate(STSRHTS23945)[506047]: ERROR:  Cannot access VG STSRHTS23945 with system ID virt-494.cluster-qe.lab.eng.brq.redhat.com with local system ID virt-495.cluster-qe.lab.eng.brq.redhat.com.
Jun 12 18:26:29 virt-495 LVM-activate(STSRHTS23945)[506047]: ERROR: STSRHTS23945: failed to activate.
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Result of start operation for STSRHTS23945 on virt-495: error (STSRHTS23945: failed to activate.)
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: STSRHTS23945_start_0@virt-495 output [ /usr/lib/ocf/resource.d/heartbeat/LVM-activate: line 556: [: -gt: unary operator expected\n  WARNING: VG STSRHTS23945 is missing PV 6GZKn9-T1F0-zr19-5nrk-76Ji-5MGl-MepO2W (last written to /dev/sdf1).\n  C...
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Transition 298 aborted by operation STSRHTS23945_start_0 'modify' on virt-495: Event failed
Jun 12 18:26:29 virt-495 pacemaker-attrd[272313]: notice: Setting last-failure-STSRHTS23945#start_0[virt-495] in instance_attributes: (unset) -> 1686587189
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Transition 298 action 9 (STSRHTS23945_start_0 on virt-495): expected 'ok' but got 'error'
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Transition 298 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=4, Source=/var/lib/pacemaker/pengine/pe-warn-1.bz2): Complete
Jun 12 18:26:29 virt-495 pacemaker-attrd[272313]: notice: Setting fail-count-STSRHTS23945#start_0[virt-495] in instance_attributes: (unset) -> INFINITY
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: Unexpected result (error: STSRHTS23945: failed to activate.) was recorded for start of STSRHTS23945 on virt-495 at Jun 12 18:26:28 2023
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: Unexpected result (error: STSRHTS23945: failed to activate.) was recorded for start of STSRHTS23945 on virt-495 at Jun 12 18:26:28 2023
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Recover    STSRHTS23945       (             virt-495 )
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Start      fs1                (             virt-495 )
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Calculated transition 299, saving inputs in /var/lib/pacemaker/pengine/pe-input-12.bz2
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: Unexpected result (error: STSRHTS23945: failed to activate.) was recorded for start of STSRHTS23945 on virt-495 at Jun 12 18:26:28 2023
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: Unexpected result (error: STSRHTS23945: failed to activate.) was recorded for start of STSRHTS23945 on virt-495 at Jun 12 18:26:28 2023
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: STSRHTS23945 cannot run on virt-495 due to reaching migration threshold (clean up resource to allow again)
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Stop       STSRHTS23945       (             virt-495 )  due to node availability
Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Calculated transition 300, saving inputs in /var/lib/pacemaker/pengine/pe-input-13.bz2
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Initiating stop operation STSRHTS23945_stop_0 locally on virt-495
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Requesting local execution of stop operation for STSRHTS23945 on virt-495
Jun 12 18:26:29 virt-495 LVM-activate(STSRHTS23945)[506123]: INFO: STSRHTS23945: has already been deactivated.
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Result of stop operation for STSRHTS23945 on virt-495: ok
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Transition 300 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-13.bz2): Complete
Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jun 12 18:27:36 virt-495 corosync[272295]:  [KNET  ] rx: host: 1 link: 0 is up
Jun 12 18:27:36 virt-495 corosync[272295]:  [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jun 12 18:27:36 virt-495 corosync[272295]:  [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 12 18:27:36 virt-495 corosync[272295]:  [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 12 18:27:37 virt-495 corosync[272295]:  [QUORUM] Sync members[2]: 1 2
Jun 12 18:27:37 virt-495 corosync[272295]:  [QUORUM] Sync joined[1]: 1
Jun 12 18:27:37 virt-495 corosync[272295]:  [TOTEM ] A new membership (1.1b) was formed. Members joined: 1
Jun 12 18:27:37 virt-495 corosync[272295]:  [QUORUM] Members[2]: 1 2
Jun 12 18:27:37 virt-495 corosync[272295]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 12 18:27:37 virt-495 pacemaker-controld[272315]: notice: Node virt-494 state is now member
Jun 12 18:27:40 virt-495 pacemaker-fenced[272311]: notice: Node virt-494 state is now member







# Normal relocate once cluster is back to good health

[root@virt-494 ~]#  pcs resource move HA_STSRHTS23945 virt-495
Location constraint to move resource 'HA_STSRHTS23945' has been created
Waiting for the cluster to apply configuration changes...
Location constraint created to move resource 'HA_STSRHTS23945' has been removed
Waiting for the cluster to apply configuration changes...
resource 'HA_STSRHTS23945' is running on node 'virt-495'





Version-Release number of selected component (if applicable):
kernel-5.14.0-322.el9    BUILT: Fri Jun  2 10:00:53 AM CEST 2023
lvm2-2.03.21-1.el9    BUILT: Fri Apr 21 02:33:33 PM CEST 2023
lvm2-libs-2.03.21-1.el9    BUILT: Fri Apr 21 02:33:33 PM CEST 2023
resource-agents-4.10.0-38.el9.x86_64  BUILT: Mon 22 May 2023 02:11:38 PM CEST

Comment 1 David Teigland 2023-06-12 19:14:53 UTC
It looks like this should be used as the RHEL9 equivalent for bug 2066156 (for the LVM-activate feature.)

We already have RHEL9 bug 2098182 for the vgchange --majoritypvs feature.

Comment 2 Oyvind Albrigtsen 2023-07-12 09:49:47 UTC

*** This bug has been marked as a duplicate of bug 2174911 ***