Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2126727

Summary:	HA testing caused failed fencing action
Product:	Red Hat OpenStack	Reporter:	Udi Shkalim <ushkalim>
Component:	osp-director-operator-container	Assignee:	Martin Schuppert <mschuppe>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	17.0 (Wallaby)	CC:	jschluet, mschuppe
Target Milestone:	---	Keywords:	AutomationBlocker, TestBlocker, Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	osp-director-operator-container-1.3.0-2	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-12-13 11:11:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Udi Shkalim 2022-09-14 10:55:46 UTC

Description of problem:

[root@controller-2 ~]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: controller-2 (version 2.1.2-4.el9-ada5c3b36e2) - partition with quorum
  * Last updated: Wed Sep 14 10:54:46 2022
  * Last change:  Tue Sep 13 20:37:09 2022 by hacluster via crmd on controller-2
  * 9 nodes configured
  * 30 resource instances configured

Node List:
  * Online: [ controller-0 controller-1 controller-2 ]
  * GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 ]

Full List of Resources:
  * ip-172.22.0.110	(ocf:heartbeat:IPaddr2):	 Started controller-1
  * stonith-fence_kubevirt-02540200000e	(stonith:fence_kubevirt):	 Started controller-2
  * stonith-fence_kubevirt-025402000000	(stonith:fence_kubevirt):	 Started controller-1
  * ip-10.0.0.10	(ocf:heartbeat:IPaddr2):	 Started controller-2
  * ip-172.17.0.10	(ocf:heartbeat:IPaddr2):	 Started controller-1
  * ip-172.18.0.10	(ocf:heartbeat:IPaddr2):	 Started controller-2
  * ip-172.19.0.10	(ocf:heartbeat:IPaddr2):	 Started controller-1
  * Container bundle set: haproxy-bundle [cluster.common.tag/haproxy:pcmklatest]:
    * haproxy-bundle-podman-0	(ocf:heartbeat:podman):	 Started controller-0
    * haproxy-bundle-podman-1	(ocf:heartbeat:podman):	 Started controller-1
    * haproxy-bundle-podman-2	(ocf:heartbeat:podman):	 Started controller-2
  * Container bundle set: galera-bundle [cluster.common.tag/mariadb:pcmklatest]:
    * galera-bundle-0	(ocf:heartbeat:galera):	 Promoted controller-1
    * galera-bundle-1	(ocf:heartbeat:galera):	 Promoted controller-2
    * galera-bundle-2	(ocf:heartbeat:galera):	 Promoted controller-0
  * Container bundle set: rabbitmq-bundle [cluster.common.tag/rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0	(ocf:heartbeat:rabbitmq-cluster):	 Started controller-1
    * rabbitmq-bundle-1	(ocf:heartbeat:rabbitmq-cluster):	 Started controller-2
    * rabbitmq-bundle-2	(ocf:heartbeat:rabbitmq-cluster):	 Started controller-0
  * stonith-fence_kubevirt-025402000007	(stonith:fence_kubevirt):	 Started controller-2
  * Container bundle: openstack-cinder-volume [cluster.common.tag/cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0	(ocf:heartbeat:podman):	 Started controller-2

Failed Fencing Actions:
  * reboot of controller-0 failed (Fence agent did not complete in time): delegate=controller-2, client=pacemaker-controld.2577, origin=controller-2, last-failed='2022-09-13 21:33:01Z'
  * reboot of controller-1 failed (Fence agent did not complete in time): delegate=controller-2, client=pacemaker-controld.2577, origin=controller-2, last-failed='2022-09-13 20:59:42Z'

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Version-Release number of selected component (if applicable):
rhos-release 17.0 -p RHOS-17.0-RHEL-9-20220909.n.0 -r 9.0


How reproducible:
50%

Steps to Reproduce:
1. Run ansible-sts test suite
2.
3.

Actual results:
Test should pass

Expected results:
test failed

Additional info:

Comment 3 Martin Schuppert 2022-09-22 06:37:13 UTC

Details on the issue which is now fixed in latest operator version:
Fencing can fail with current RunStrategy RerunOnFailure because fencing is not a graceful shutdown and kubevirt will jump in and restart immediately the VM. In such situation fence_kubevirt is waiting for the successful shutdown state of the VM before triggering the start and the fencing action will time out.

Comment 8 Udi Shkalim 2022-12-12 13:47:18 UTC

Verified

Comment 12 errata-xmlrpc 2022-12-13 11:11:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of containers for Red Hat OpenStack Platform 16.2.4 director operator), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8952