Bug 2126727
| Summary: | HA testing caused failed fencing action | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Udi Shkalim <ushkalim> |
| Component: | osp-director-operator-container | Assignee: | Martin Schuppert <mschuppe> |
| Status: | CLOSED ERRATA | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 17.0 (Wallaby) | CC: | jschluet, mschuppe |
| Target Milestone: | --- | Keywords: | AutomationBlocker, TestBlocker, Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | osp-director-operator-container-1.3.0-2 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-12-13 11:11:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Details on the issue which is now fixed in latest operator version: Fencing can fail with current RunStrategy RerunOnFailure because fencing is not a graceful shutdown and kubevirt will jump in and restart immediately the VM. In such situation fence_kubevirt is waiting for the successful shutdown state of the VM before triggering the start and the fencing action will time out. Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of containers for Red Hat OpenStack Platform 16.2.4 director operator), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8952 |
Description of problem: [root@controller-2 ~]# pcs status Cluster name: tripleo_cluster Cluster Summary: * Stack: corosync * Current DC: controller-2 (version 2.1.2-4.el9-ada5c3b36e2) - partition with quorum * Last updated: Wed Sep 14 10:54:46 2022 * Last change: Tue Sep 13 20:37:09 2022 by hacluster via crmd on controller-2 * 9 nodes configured * 30 resource instances configured Node List: * Online: [ controller-0 controller-1 controller-2 ] * GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 ] Full List of Resources: * ip-172.22.0.110 (ocf:heartbeat:IPaddr2): Started controller-1 * stonith-fence_kubevirt-02540200000e (stonith:fence_kubevirt): Started controller-2 * stonith-fence_kubevirt-025402000000 (stonith:fence_kubevirt): Started controller-1 * ip-10.0.0.10 (ocf:heartbeat:IPaddr2): Started controller-2 * ip-172.17.0.10 (ocf:heartbeat:IPaddr2): Started controller-1 * ip-172.18.0.10 (ocf:heartbeat:IPaddr2): Started controller-2 * ip-172.19.0.10 (ocf:heartbeat:IPaddr2): Started controller-1 * Container bundle set: haproxy-bundle [cluster.common.tag/haproxy:pcmklatest]: * haproxy-bundle-podman-0 (ocf:heartbeat:podman): Started controller-0 * haproxy-bundle-podman-1 (ocf:heartbeat:podman): Started controller-1 * haproxy-bundle-podman-2 (ocf:heartbeat:podman): Started controller-2 * Container bundle set: galera-bundle [cluster.common.tag/mariadb:pcmklatest]: * galera-bundle-0 (ocf:heartbeat:galera): Promoted controller-1 * galera-bundle-1 (ocf:heartbeat:galera): Promoted controller-2 * galera-bundle-2 (ocf:heartbeat:galera): Promoted controller-0 * Container bundle set: rabbitmq-bundle [cluster.common.tag/rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf:heartbeat:rabbitmq-cluster): Started controller-1 * rabbitmq-bundle-1 (ocf:heartbeat:rabbitmq-cluster): Started controller-2 * rabbitmq-bundle-2 (ocf:heartbeat:rabbitmq-cluster): Started controller-0 * stonith-fence_kubevirt-025402000007 (stonith:fence_kubevirt): Started controller-2 * Container bundle: openstack-cinder-volume [cluster.common.tag/cinder-volume:pcmklatest]: * openstack-cinder-volume-podman-0 (ocf:heartbeat:podman): Started controller-2 Failed Fencing Actions: * reboot of controller-0 failed (Fence agent did not complete in time): delegate=controller-2, client=pacemaker-controld.2577, origin=controller-2, last-failed='2022-09-13 21:33:01Z' * reboot of controller-1 failed (Fence agent did not complete in time): delegate=controller-2, client=pacemaker-controld.2577, origin=controller-2, last-failed='2022-09-13 20:59:42Z' Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Version-Release number of selected component (if applicable): rhos-release 17.0 -p RHOS-17.0-RHEL-9-20220909.n.0 -r 9.0 How reproducible: 50% Steps to Reproduce: 1. Run ansible-sts test suite 2. 3. Actual results: Test should pass Expected results: test failed Additional info: