Bug 2214350
| Summary: | HA LVM raid resource with systemid mode is unable to fail over with a device failure | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
| Status: | CLOSED DUPLICATE | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 9.3 | CC: | agk, cluster-maint, fdinitto, teigland |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-07-12 09:49:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
It looks like this should be used as the RHEL9 equivalent for bug 2066156 (for the LVM-activate feature.) We already have RHEL9 bug 2098182 for the vgchange --majoritypvs feature. *** This bug has been marked as a duplicate of bug 2174911 *** |
Description of problem: [root@virt-494 ~]# grep system_id_source /etc/lvm/lvm.conf system_id_source = uname # edited by QA Fri Jun 9 18:51:05 2023 Creating single VG STSRHTS23945 out of /dev/sdf1 /dev/sde1 /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sdc1 virt-495: lvmdevices --adddev /dev/sdf1 virt-495: lvmdevices --adddev /dev/sde1 virt-495: lvmdevices --adddev /dev/sda1 virt-495: lvmdevices --adddev /dev/sdb1 virt-495: lvmdevices --adddev /dev/sdd1 virt-495: lvmdevices --adddev /dev/sdc1 Creating HA raid1 LV(s) and ext4 filesystems on VG STSRHTS23945 lvcreate --yes --activate y --type raid1 --nosync -L 8G -n lv1 STSRHTS23945 Verify STSRHTS23945/lv1 systemid: virt-494.cluster-qe.lab.eng.brq.redhat.com Creating ext4 filesystem mkfs.ext4 /dev/STSRHTS23945/lv1 mke2fs 1.46.5 (30-Dec-2021) pcs resource create STSRHTS23945 --group HA_STSRHTS23945 ocf:heartbeat:LVM-activate vgname="STSRHTS23945" activation_mode=exclusive vg_access_mode=system_id pcs resource create fs1 --group HA_STSRHTS23945 ocf:heartbeat:Filesystem device="/dev/STSRHTS23945/lv1" directory="/mnt/fs1" fstype="ext4" "options=noatime" op monitor interval=10s Running cleanup to fix any potential timing issues during setup pcs resource cleanup Cleaned up all resources on all nodes Checking status of resources on all nodes Filesystem:fs1 on LVM-activate:STSRHTS23945 (in group: HA_STSRHTS23945) Current owner for fs1 is virt-494 Enabling automatic startup pcs cluster enable --all virt-494: Cluster Enabled virt-495: Cluster Enabled [root@virt-494 ~]# pcs status Cluster name: STSRHTS23945 Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: virt-495 (version 2.1.6-2.el9-6fdc9deea29) - partition with quorum * Last updated: Mon Jun 12 18:20:52 2023 on virt-494 * Last change: Mon Jun 12 18:18:08 2023 by hacluster via crmd on virt-495 * 2 nodes configured * 4 resource instances configured Node List: * Online: [ virt-494 virt-495 ] Full List of Resources: * fence-virt-494 (stonith:fence_xvm): Started virt-495 * fence-virt-495 (stonith:fence_xvm): Started virt-495 * Resource Group: HA_STSRHTS23945: * STSRHTS23945 (ocf:heartbeat:LVM-activate): Started virt-494 * fs1 (ocf:heartbeat:Filesystem): Started virt-494 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@virt-494 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/STSRHTS23945-lv1 7.8G 24K 7.4G 1% /mnt/fs1 [root@virt-494 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices lv1 STSRHTS23945 Rwi-aor--- 8.00g 100.00 lv1_rimage_0(0),lv1_rimage_1(0) [lv1_rimage_0] STSRHTS23945 iwi-aor--- 8.00g /dev/sdc1(1) [lv1_rimage_1] STSRHTS23945 iwi-aor--- 8.00g /dev/sdf1(1) [lv1_rmeta_0] STSRHTS23945 ewi-aor--- 4.00m /dev/sdc1(0) [lv1_rmeta_1] STSRHTS23945 ewi-aor--- 4.00m /dev/sdf1(0) [root@virt-494 ~]# vgs -a -o +vg_systemid VG #PV #LV #SN Attr VSize VFree System ID STSRHTS23945 6 1 0 wz--n- 449.95g <433.95g virt-494.cluster-qe.lab.eng.brq.redhat.com # virt-495 [root@virt-495 ~]# echo offline > /sys/block/sdf/device/state [root@virt-495 ~]# # virt-494 [root@virt-494 ~]# reboot -fin Rebooting. [root@virt-495 ~]# pcs status Cluster name: STSRHTS23945 Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: virt-495 (version 2.1.6-2.el9-6fdc9deea29) - partition with quorum * Last updated: Mon Jun 12 18:27:07 2023 on virt-495 * Last change: Mon Jun 12 18:18:08 2023 by hacluster via crmd on virt-495 * 2 nodes configured * 4 resource instances configured Node List: * Online: [ virt-495 ] * OFFLINE: [ virt-494 ] Full List of Resources: * fence-virt-494 (stonith:fence_xvm): Started virt-495 * fence-virt-495 (stonith:fence_xvm): Started virt-495 * Resource Group: HA_STSRHTS23945: * STSRHTS23945 (ocf:heartbeat:LVM-activate): Stopped * fs1 (ocf:heartbeat:Filesystem): Stopped Failed Resource Actions: * STSRHTS23945 start on virt-495 returned 'error' (STSRHTS23945: failed to activate.) at Mon Jun 12 18:26:28 2023 after 502ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Jun 12 18:26:20 virt-495 pacemaker-attrd[272313]: notice: Node virt-494 state is now lost Jun 12 18:26:20 virt-495 pacemaker-attrd[272313]: notice: Removing all virt-494 attributes for peer loss Jun 12 18:26:20 virt-495 pacemaker-attrd[272313]: notice: Purged 1 peer with id=1 and/or uname=virt-494 from the membership cache Jun 12 18:26:20 virt-495 pacemaker-fenced[272311]: notice: Node virt-494 state is now lost Jun 12 18:26:20 virt-495 pacemaker-fenced[272311]: notice: Purged 1 peer with id=1 and/or uname=virt-494 from the membership cache Jun 12 18:26:20 virt-495 pacemaker-based[272310]: notice: Node virt-494 state is now lost Jun 12 18:26:20 virt-495 pacemaker-based[272310]: notice: Purged 1 peer with id=1 and/or uname=virt-494 from the membership cache Jun 12 18:26:20 virt-495 pacemaker-controld[272315]: warning: Stonith/shutdown of node virt-494 was not expected Jun 12 18:26:20 virt-495 pacemaker-controld[272315]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jun 12 18:26:20 virt-495 pacemaker-controld[272315]: notice: Node virt-494 state is now lost Jun 12 18:26:20 virt-495 pacemaker-controld[272315]: warning: Stonith/shutdown of node virt-494 was not expected Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: Cluster node virt-494 will be fenced: peer is no longer part of the cluster Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: virt-494 is unclean Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: STSRHTS23945_stop_0 on virt-494 is unrunnable (node is offline) Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: fs1_stop_0 on virt-494 is unrunnable (node is offline) Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: Scheduling node virt-494 for fencing Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Fence (reboot) virt-494 'peer is no longer part of the cluster' Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Move STSRHTS23945 ( virt-494 -> virt-495 ) Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Move fs1 ( virt-494 -> virt-495 ) Jun 12 18:26:21 virt-495 pacemaker-schedulerd[272314]: warning: Calculated transition 298 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-1.bz2 Jun 12 18:26:21 virt-495 pacemaker-controld[272315]: notice: Requesting fencing (reboot) targeting node virt-494 Jun 12 18:26:21 virt-495 pacemaker-fenced[272311]: notice: Client pacemaker-controld.272315 wants to fence (reboot) virt-494 using any device Jun 12 18:26:21 virt-495 pacemaker-fenced[272311]: notice: Requesting peer fencing (reboot) targeting virt-494 Jun 12 18:26:21 virt-495 pacemaker-fenced[272311]: notice: Requesting that virt-495 perform 'reboot' action targeting virt-494 Jun 12 18:26:28 virt-495 fence_xvm[506038]: Domain "virt-494.cluster-qe.lab.eng.brq.redhat.com" is ON Jun 12 18:26:28 virt-495 pacemaker-fenced[272311]: notice: Operation 'reboot' [506038] targeting virt-494 using fence-virt-494 returned 0 Jun 12 18:26:28 virt-495 pacemaker-fenced[272311]: notice: Operation 'reboot' targeting virt-494 by virt-495 for pacemaker-controld.272315@virt-495: OK (complete) Jun 12 18:26:28 virt-495 pacemaker-controld[272315]: notice: Initiating start operation STSRHTS23945_start_0 locally on virt-495 Jun 12 18:26:28 virt-495 pacemaker-controld[272315]: notice: Peer virt-494 was terminated (reboot) by virt-495 on behalf of pacemaker-controld.272315@virt-495: OK Jun 12 18:26:28 virt-495 pacemaker-controld[272315]: notice: Requesting local execution of start operation for STSRHTS23945 on virt-495 Jun 12 18:26:29 virt-495 LVM-activate(STSRHTS23945)[506047]: INFO: Activating STSRHTS23945 Jun 12 18:26:29 virt-495 LVM-activate(STSRHTS23945)[506047]: ERROR: Cannot access VG STSRHTS23945 with system ID virt-494.cluster-qe.lab.eng.brq.redhat.com with local system ID virt-495.cluster-qe.lab.eng.brq.redhat.com. Jun 12 18:26:29 virt-495 LVM-activate(STSRHTS23945)[506047]: ERROR: STSRHTS23945: failed to activate. Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Result of start operation for STSRHTS23945 on virt-495: error (STSRHTS23945: failed to activate.) Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: STSRHTS23945_start_0@virt-495 output [ /usr/lib/ocf/resource.d/heartbeat/LVM-activate: line 556: [: -gt: unary operator expected\n WARNING: VG STSRHTS23945 is missing PV 6GZKn9-T1F0-zr19-5nrk-76Ji-5MGl-MepO2W (last written to /dev/sdf1).\n C... Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Transition 298 aborted by operation STSRHTS23945_start_0 'modify' on virt-495: Event failed Jun 12 18:26:29 virt-495 pacemaker-attrd[272313]: notice: Setting last-failure-STSRHTS23945#start_0[virt-495] in instance_attributes: (unset) -> 1686587189 Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Transition 298 action 9 (STSRHTS23945_start_0 on virt-495): expected 'ok' but got 'error' Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Transition 298 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=4, Source=/var/lib/pacemaker/pengine/pe-warn-1.bz2): Complete Jun 12 18:26:29 virt-495 pacemaker-attrd[272313]: notice: Setting fail-count-STSRHTS23945#start_0[virt-495] in instance_attributes: (unset) -> INFINITY Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: Unexpected result (error: STSRHTS23945: failed to activate.) was recorded for start of STSRHTS23945 on virt-495 at Jun 12 18:26:28 2023 Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: Unexpected result (error: STSRHTS23945: failed to activate.) was recorded for start of STSRHTS23945 on virt-495 at Jun 12 18:26:28 2023 Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Recover STSRHTS23945 ( virt-495 ) Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Start fs1 ( virt-495 ) Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Calculated transition 299, saving inputs in /var/lib/pacemaker/pengine/pe-input-12.bz2 Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: Unexpected result (error: STSRHTS23945: failed to activate.) was recorded for start of STSRHTS23945 on virt-495 at Jun 12 18:26:28 2023 Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: Unexpected result (error: STSRHTS23945: failed to activate.) was recorded for start of STSRHTS23945 on virt-495 at Jun 12 18:26:28 2023 Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: warning: STSRHTS23945 cannot run on virt-495 due to reaching migration threshold (clean up resource to allow again) Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Actions: Stop STSRHTS23945 ( virt-495 ) due to node availability Jun 12 18:26:29 virt-495 pacemaker-schedulerd[272314]: notice: Calculated transition 300, saving inputs in /var/lib/pacemaker/pengine/pe-input-13.bz2 Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Initiating stop operation STSRHTS23945_stop_0 locally on virt-495 Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Requesting local execution of stop operation for STSRHTS23945 on virt-495 Jun 12 18:26:29 virt-495 LVM-activate(STSRHTS23945)[506123]: INFO: STSRHTS23945: has already been deactivated. Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Result of stop operation for STSRHTS23945 on virt-495: ok Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: Transition 300 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-13.bz2): Complete Jun 12 18:26:29 virt-495 pacemaker-controld[272315]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Jun 12 18:27:36 virt-495 corosync[272295]: [KNET ] rx: host: 1 link: 0 is up Jun 12 18:27:36 virt-495 corosync[272295]: [KNET ] link: Resetting MTU for link 0 because host 1 joined Jun 12 18:27:36 virt-495 corosync[272295]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Jun 12 18:27:36 virt-495 corosync[272295]: [KNET ] pmtud: Global data MTU changed to: 1397 Jun 12 18:27:37 virt-495 corosync[272295]: [QUORUM] Sync members[2]: 1 2 Jun 12 18:27:37 virt-495 corosync[272295]: [QUORUM] Sync joined[1]: 1 Jun 12 18:27:37 virt-495 corosync[272295]: [TOTEM ] A new membership (1.1b) was formed. Members joined: 1 Jun 12 18:27:37 virt-495 corosync[272295]: [QUORUM] Members[2]: 1 2 Jun 12 18:27:37 virt-495 corosync[272295]: [MAIN ] Completed service synchronization, ready to provide service. Jun 12 18:27:37 virt-495 pacemaker-controld[272315]: notice: Node virt-494 state is now member Jun 12 18:27:40 virt-495 pacemaker-fenced[272311]: notice: Node virt-494 state is now member # Normal relocate once cluster is back to good health [root@virt-494 ~]# pcs resource move HA_STSRHTS23945 virt-495 Location constraint to move resource 'HA_STSRHTS23945' has been created Waiting for the cluster to apply configuration changes... Location constraint created to move resource 'HA_STSRHTS23945' has been removed Waiting for the cluster to apply configuration changes... resource 'HA_STSRHTS23945' is running on node 'virt-495' Version-Release number of selected component (if applicable): kernel-5.14.0-322.el9 BUILT: Fri Jun 2 10:00:53 AM CEST 2023 lvm2-2.03.21-1.el9 BUILT: Fri Apr 21 02:33:33 PM CEST 2023 lvm2-libs-2.03.21-1.el9 BUILT: Fri Apr 21 02:33:33 PM CEST 2023 resource-agents-4.10.0-38.el9.x86_64 BUILT: Mon 22 May 2023 02:11:38 PM CEST