Bug 1943772
| Summary: | [16.2] Controller replacement can fail at step 5 in a composable IHA env | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Michele Baldessari <michele> |
| Component: | puppet-tripleo | Assignee: | Michele Baldessari <michele> |
| Status: | CLOSED ERRATA | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 16.2 (Train) | CC: | dciabrin, jjoyce, jschluet, lmiccini, mburns, slinaber, tvignaud |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | puppet-tripleo-11.6.2-2.20210428172107.5c76ddc.el8ost.2 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-09-15 07:13:13 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I can reproduce this one quite often. The galera manage is something done here: https://github.com/rhos-infra/cloud-config/blob/master/post_tasks/roles/replace-controller/tasks/manual_preparation.yml#L48 So I think we can ignore that for now. I am not sure yet as to why the restart fail during the deploy. In a cluster with the following pending actions: Transition Summary: * Fence (on) compute-1 'required by compute-unfence-trigger:1 start' * Fence (on) compute-0 'required by compute-unfence-trigger:0 start' * Restart compute-unfence-trigger:0 ( compute-0 ) due to required stonith * Restart compute-unfence-trigger:1 ( compute-1 ) due to required stonith The following manual command still seems to work: [root@controller-1 ~]# date; pcs resource restart --wait=300 openstack-cinder-volume ; date Mon Mar 29 07:41:24 UTC 2021 openstack-cinder-volume successfully restarted Mon Mar 29 07:41:58 UTC 2021 The error is: 2021-03-29 07:19:12.935 219900 DEBUG paunch [ ] b'Mon Mar 29 06:59:09 UTC 2021: Restarting openstack-cinder-volume globally. Stopping:\nMon Mar 29 07:09:11 UTC 2021: Restarting openstack-cinder-volume globally. Starting:\n' 2021-03-29 07:19:12.935 219900 DEBUG paunch [ ] b'time="2021-03-29T06:59:08Z" level=error msg="Error loading CNI config list file /etc/cni/net.d/87-podman-bridge.conflist: error parsing configuration list: no name"\nError: waiting timeout\n\ncrm_resource: Error performing operation: Timer expired\nPending actions:\n\tAction 303: compute-unfence-trigger:1_start_0\ton compute-1\n\tAction 302: compute-unfence-trigger:1_stop_0\ton compute-1\n\tAction 301: compute-unfence-trigger:0_start_0\ton compute-0\n\tAction 300: compute-unfence-trigger:0_stop_0\ton compute-0\n\tAction 87: stonith-compute-0-on\ton compute-0\n\tAction 86: stonith-compute-1-on\ton compute-1\n\tAction 48: compute-unfence-trigger:1_monitor_10000\ton compute-1\n\tAction 47: compute-unfence-trigger:0_monitor_10000\ton compute-0\nError: waiting timeout\n\ncrm_resource: Error performing operation: Timer expired\nPending actions:\n\tAction 304: compute-unfence-trigger:1_start_0\ton compute-1\n\tAction 303: compute-unfence-trigger:1_stop_0\ton compute-1\n\tAction 302: compute-unfence-trigger:0_start_0\ton compute-0\n\tAction 301: compute-unfence-trigger:0_stop_0\ton compute-0\n\tAction 88: stonith-compute-0-on\ton compute-0\n\tAction 87: stonith-compute-1-on\ton compute-1\n\tAction 48: compute-unfence-trigger:1_monitor_10000\ton compute-1\n\tAction 47: compute-unfence-trigger:0_monitor_10000\ton compute-0\n' 2021-03-29 07:19:12.936 219900 ERROR paunch [ ] Error running ['podman', 'run', '--name', 'cinder_volume_restart_bundle', '--label', 'config_id=tripleo_step5', '--label', 'container_name=cinder_volume_restart_bundle', '--label', 'managed_by=tripleo-ControllerOpenstack', '--label', 'config_data={"command": "/var/lib/container-config-scripts/pacemaker_restart_bundle.sh cinder_volume openstack-cinder-volume openstack-cinder-volume _ Started", "config_volume": "cinder", "detach": false, "environment": {"TRIPLEO_MINOR_UPDATE": "", "TRIPLEO_CONFIG_HASH": "67bb174a47812a3a4e05a13992175087"}, "image": "undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:16.2_20210323.1-hotfixupdate2", "ipc": "host", "net": "host", "start_order": 2, "user": "root", "volumes": ["/etc/hosts:/etc/hosts:ro", "/etc/localtime:/etc/localtime:ro", "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro", "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro", "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro", "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro", "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro", "/dev/log:/dev/log", "/etc/ipa/ca.crt:/etc/ipa/ca.crt:ro", "/var/lib/container-config-scripts:/var/lib/container-config-scripts:ro", "/dev/shm:/dev/shm:rw", "/etc/puppet:/etc/puppet:ro", "/var/lib/config-data/puppet-generated/cinder:/var/lib/kolla/config_files/src:ro"]}', '--conmon-pidfile=/var/run/cinder_volume_restart_bundle.pid', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/cinder_volume_restart_bundle.log', '--env=TRIPLEO_CONFIG_HASH=67bb174a47812a3a4e05a13992175087', '--env=TRIPLEO_MINOR_UPDATE', '--net=host', '--ipc=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ipa/ca.crt:/etc/ipa/ca.crt:ro', '--volume=/var/lib/container-config-scripts:/var/lib/container-config-scripts:ro', '--volume=/dev/shm:/dev/shm:rw', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/config-data/puppet-generated/cinder:/var/lib/kolla/config_files/src:ro', 'undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:16.2_20210323.1-hotfixupdate2', '/var/lib/container-config-scripts/pacemaker_restart_bundle.sh', 'cinder_volume', 'openstack-cinder-volume', 'openstack-cinder-volume', '_', 'Started']. [1] But this also means that the pcs on host has not yet landed on 16.2 composes, which is also quite concerning. As kind of expected, this happens also with pcs on host:
2021-04-11 06:04:42.686199 | 5254001f-5135-53b1-9a16-00000000ccb2 | FATAL | Run pacemaker restart if the config file for the service changed | controller-1 | error={"changed": false, "error": "Failed running command", "msg": "Error running /var/lib/container-config-scripts/pacemaker_restart_bundle.sh cinder_volume openstack-cinder-volume openstack-cinder-volume _ Started. rc: 1, stdout: Sun Apr 11 05:44:39 UTC 2021: Restarting openstack-cinder-volume globally. Stopping:\nSun Apr 11 05:54:41 UTC 2021: Restarting openstack-cinder-volume globally. Starting:\n, stderr: Error: waiting timeout\n\ncrm_resource: Error performing operation: Timer expired\nPending actions:\n\tAction 305: compute-unfence-trigger:3_monitor_10000\ton compute-1\n\tAction 304: compute-unfence-trigger:3_start_0\ton compute-1\n\tAction 303: compute-unfence-trigger:2_monitor_10000\ton compute-0\n\tAction 302: compute-unfence-trigger:2_start_0\ton compute-0\n\tAction 85: stonith-compute-0-on\ton compute-0\n\tAction 84: stonith-compute-1-on\ton compute-1\nError: waiting timeout\n\ncrm_resource: Error performing operation: Timer expired\nPending actions:\n\tAction 307: compute-unfence-trigger:3_monitor_10000\ton compute-0\n\tAction 306: compute-unfence-trigger:3_start_0\ton compute-0\n\tAction 301: compute-unfence-trigger:0_start_0\ton compute-1\n\tAction 300: compute-unfence-trigger:0_stop_0\ton compute-1\n\tAction 87: stonith-compute-0-on\ton compute-0\n\tAction 86: stonith-compute-1-on\ton compute-1\n\tAction 47: compute-unfence-trigger:0_monitor_10000\ton compute-1\n"}
Overcloud configuration failed.
2021-04-11 06:04:42.686800 | 5254001f-5135-53b1-9a16-00000000ccb2 | TIMING | tripleo_ha_wrapper : Run pacemaker restart if the config file for the service changed | controller-1 | 1:02:23.064958 | 1204.44s
Running it by hand also fails, likely due to the fact that there are a number of pending actions:
[root@controller-1 ~]# /var/lib/container-config-scripts/pacemaker_restart_bundle.sh cinder_volume openstack-cinder-volume openstack-cinder-volume _ Started
Sun Apr 11 08:14:17 UTC 2021: Restarting openstack-cinder-volume globally. Stopping:
Error: waiting timeout
crm_resource: Error performing operation: Timer expired
Pending actions:
Action 306: compute-unfence-trigger:3_monitor_10000 on compute-1
Action 305: compute-unfence-trigger:3_start_0 on compute-1
Action 300: compute-unfence-trigger:0_start_0 on compute-0
Action 299: compute-unfence-trigger:0_stop_0 on compute-0
Action 86: stonith-compute-0-on on compute-0
Action 85: stonith-compute-1-on on compute-1
Action 47: compute-unfence-trigger:0_monitor_10000 on compute-0
Sun Apr 11 08:24:20 UTC 2021: Restarting openstack-cinder-volume globally. Starting:
Ops paste got cut:
Sun Apr 11 08:24:20 UTC 2021: Restarting openstack-cinder-volume globally. Starting:
Error: waiting timeout
crm_resource: Error performing operation: Timer expired
Pending actions:
Action 307: compute-unfence-trigger:3_monitor_10000 on compute-0
Action 306: compute-unfence-trigger:3_start_0 on compute-0
Action 301: compute-unfence-trigger:0_start_0 on compute-1
Action 300: compute-unfence-trigger:0_stop_0 on compute-1
Action 87: stonith-compute-0-on on compute-0
Action 86: stonith-compute-1-on on compute-1
Action 47: compute-unfence-trigger:0_monitor_10000 on compute-1
[root@controller-1 ~]#
crm_simulate -Ls gives:
Transition Summary:
* Fence (on) compute-1 'required by compute-unfence-trigger:0 start'
* Fence (on) compute-0 'required by compute-unfence-trigger:3 start'
* Restart compute-unfence-trigger:0 ( compute-1 ) due to required stonith
* Start compute-unfence-trigger:3 ( compute-0 )
Even after doing a full "pcs resource cleanup" + "pcs stonith cleanup" we are still left with:
Transition Summary:
* Fence (on) compute-1 'required by compute-unfence-trigger:3 start'
* Fence (on) compute-0 'required by compute-unfence-trigger:2 start'
* Start compute-unfence-trigger:2 ( compute-0 )
* Start compute-unfence-trigger:3 ( compute-1 )
Full status being:
[root@controller-1 ~]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
* Stack: corosync
* Current DC: messaging-0 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
* Last updated: Sun Apr 11 08:37:40 2021
* Last change: Sun Apr 11 08:35:31 2021 by hacluster via crmd on messaging-1
* 25 nodes configured
* 89 resource instances configured
Node List:
* GuestNode galera-bundle-0@database-0: maintenance
* GuestNode galera-bundle-1@database-1: maintenance
* GuestNode galera-bundle-2@database-2: maintenance
* Online: [ controller-1 controller-2 controller-3 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
* RemoteOnline: [ compute-0 compute-1 compute-2 compute-3 ]
* GuestOnline: [ ovn-dbs-bundle-0@controller-2 ovn-dbs-bundle-1@controller-3 ovn-dbs-bundle-2@controller-1 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1@messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-3 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]
Full List of Resources:
* compute-0 (ocf::pacemaker:remote): Started controller-1
* compute-1 (ocf::pacemaker:remote): Started controller-2
* Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest] (unmanaged):
* galera-bundle-0 (ocf::heartbeat:galera): Master database-0 (unmanaged)
* galera-bundle-1 (ocf::heartbeat:galera): Master database-1 (unmanaged)
* galera-bundle-2 (ocf::heartbeat:galera): Master database-2 (unmanaged)
* Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
* rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started messaging-0
* rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started messaging-1
* rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started messaging-2
* ip-192.168.24.150 (ocf::heartbeat:IPaddr2): Started controller-3
* ip-10.0.0.150 (ocf::heartbeat:IPaddr2): Started controller-1
* ip-172.17.1.151 (ocf::heartbeat:IPaddr2): Started controller-2
* ip-172.17.1.150 (ocf::heartbeat:IPaddr2): Started controller-3
* ip-172.17.3.150 (ocf::heartbeat:IPaddr2): Started controller-1
* ip-172.17.4.150 (ocf::heartbeat:IPaddr2): Started controller-2
* Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]:
* haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-3
* haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1
* haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-2
* Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]:
* redis-bundle-0 (ocf::heartbeat:redis): Slave controller-3
* redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1
* redis-bundle-2 (ocf::heartbeat:redis): Master controller-2
* Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
* ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-2
* ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-3
* ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1
* ip-172.17.1.124 (ocf::heartbeat:IPaddr2): Started controller-2
* stonith-fence_compute-fence-nova (stonith:fence_compute): Started messaging-2
* Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]:
* Started: [ compute-2 compute-3 ]
* Stopped: [ compute-0 compute-1 controller-1 controller-2 controller-3 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
* nova-evacuate (ocf::openstack:NovaEvacuate): Started messaging-1
* stonith-fence_ipmilan-525400ae968f (stonith:fence_ipmilan): Started messaging-0
* stonith-fence_ipmilan-52540004ccd5 (stonith:fence_ipmilan): Started database-2
* stonith-fence_ipmilan-52540050b535 (stonith:fence_ipmilan): Started database-0
* stonith-fence_ipmilan-52540069361a (stonith:fence_ipmilan): Started database-1
* stonith-fence_ipmilan-525400f1546f (stonith:fence_ipmilan): Started messaging-1
* stonith-fence_ipmilan-5254005a1b96 (stonith:fence_ipmilan): Started messaging-0
* stonith-fence_ipmilan-525400d0dca3 (stonith:fence_ipmilan): Started messaging-2
* stonith-fence_ipmilan-525400ec2aaa (stonith:fence_ipmilan): Started database-2
* stonith-fence_ipmilan-5254004b0708 (stonith:fence_ipmilan): Started messaging-2
* stonith-fence_ipmilan-525400d72431 (stonith:fence_ipmilan): Started messaging-1
* Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]:
* openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-3
* stonith-fence_ipmilan-5254004dce18 (stonith:fence_ipmilan): Started messaging-0
* compute-2 (ocf::pacemaker:remote): Started database-0
* compute-3 (ocf::pacemaker:remote): Started database-1
Failed Fencing Actions:
* unfencing of compute-1 failed: delegate=messaging-1, client=pacemaker-controld.102516, origin=messaging-0, last-failed='2021-04-11 04:40:06Z'
* unfencing of compute-0 failed: delegate=controller-2, client=pacemaker-controld.102516, origin=messaging-0, last-failed='2021-04-11 04:38:55Z'
* unfencing of compute-2 failed: delegate=, client=pacemaker-controld.102516, origin=messaging-0, last-failed='2021-04-11 04:30:06Z'
* unfencing of compute-3 failed: delegate=, client=pacemaker-controld.102516, origin=messaging-0, last-failed='2021-04-11 04:30:05Z'
Pending Fencing Actions:
* unfencing of compute-0 pending: client=pacemaker-controld.102516, origin=messaging-0
* unfencing of compute-1 pending: client=pacemaker-controld.102516, origin=messaging-0
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
The problem seems to me that we're missing stonith devices for the compute nodes that we gained in the scaleup:
[root@controller-1 ~]# pcs stonith
* stonith-fence_compute-fence-nova (stonith:fence_compute): Started messaging-2
* stonith-fence_ipmilan-525400ae968f (stonith:fence_ipmilan): Started messaging-0
* stonith-fence_ipmilan-52540004ccd5 (stonith:fence_ipmilan): Started database-2
* stonith-fence_ipmilan-52540050b535 (stonith:fence_ipmilan): Started database-0
* stonith-fence_ipmilan-52540069361a (stonith:fence_ipmilan): Started database-1
* stonith-fence_ipmilan-525400f1546f (stonith:fence_ipmilan): Started messaging-1
* stonith-fence_ipmilan-5254005a1b96 (stonith:fence_ipmilan): Started messaging-0
* stonith-fence_ipmilan-525400d0dca3 (stonith:fence_ipmilan): Started messaging-2
* stonith-fence_ipmilan-525400ec2aaa (stonith:fence_ipmilan): Started database-2
* stonith-fence_ipmilan-5254004b0708 (stonith:fence_ipmilan): Started messaging-2
* stonith-fence_ipmilan-525400d72431 (stonith:fence_ipmilan): Started messaging-1
* stonith-fence_ipmilan-5254004dce18 (stonith:fence_ipmilan): Started messaging-0
Target: compute-0
Level 1 - stonith-fence_ipmilan-525400ec2aaa,stonith-fence_compute-fence-nova
Target: compute-1
Level 1 - stonith-fence_ipmilan-52540069361a,stonith-fence_compute-fence-nova
Target: controller-0
Level 1 - stonith-fence_ipmilan-5254004dce18
Target: controller-1
Level 1 - stonith-fence_ipmilan-5254004b0708
Target: controller-2
Level 1 - stonith-fence_ipmilan-525400d72431
Target: database-0
Level 1 - stonith-fence_ipmilan-525400d0dca3
Target: database-1
Level 1 - stonith-fence_ipmilan-5254005a1b96
Target: database-2
Level 1 - stonith-fence_ipmilan-52540050b535
Target: messaging-0
Level 1 - stonith-fence_ipmilan-525400f1546f
Target: messaging-1
Level 1 - stonith-fence_ipmilan-52540004ccd5
Target: messaging-2
Level 1 - stonith-fence_ipmilan-525400ae968f
This is prolly more an issue how we code the scaling up really.
Ok the root cause is cloud-config. It creates a ~/newnodes.json instead of expanding the existing instackenv.json and it also never calls 'openstack overcloud generate fencing' so there is no stonith config for the new compute nodes that were scaled up before. This also affects the controller replacement because there won't be any stonith resource created for the new controller (controller-3 in our case) There are two aspects to this issue. One issue is in https://review.opendev.org/c/openstack/puppet-tripleo/+/785863 which has merged upstream in train. The other one is purely infrared related and is here: https://projects.engineering.redhat.com/browse/RHOSINFRA-4003 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483 |
Description of problem: At least once so far I observed the controller replacement process failing at step 5 with a timeout: 2021-03-27 06:55:39.990599 | 52540007-3e3c-1bdb-5657-00000000d11e | TIMING | Wait for containers to start for step 5 using paunch | controller-1 | 1:13:59.510591 | 1254.49s The timeout happened while waiting for the cinder-volume bundle to be restarted. I believe the restart failed because there are pending actions in the cluster: Transition Summary: * Fence (on) compute-1 'required by compute-unfence-trigger:1 start' * Fence (on) compute-0 'required by compute-unfence-trigger:0 start' * Restart compute-unfence-trigger:0 ( compute-0 ) due to required stonith * Restart compute-unfence-trigger:1 ( compute-1 ) due to required stonith The cluster status is the following: Node List: * GuestNode galera-bundle-0@database-0: maintenance * GuestNode galera-bundle-1@database-1: maintenance * GuestNode galera-bundle-2@database-2: maintenance * Online: [ controller-1 controller-2 controller-3 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ] * RemoteOnline: [ compute-0 compute-1 compute-2 compute-3 ] * GuestOnline: [ ovn-dbs-bundle-0@controller-2 ovn-dbs-bundle-1@controller-3 ovn-dbs-bundle-2@controller-1 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1@messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-3 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full List of Resources: * compute-0 (ocf::pacemaker:remote): Started controller-1 * compute-1 (ocf::pacemaker:remote): Started controller-2 * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest] (unmanaged): * galera-bundle-0 (ocf::heartbeat:galera): Master database-0 (unmanaged) * galera-bundle-1 (ocf::heartbeat:galera): Master database-1 (unmanaged) * galera-bundle-2 (ocf::heartbeat:galera): Master database-2 (unmanaged) * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started messaging-0 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started messaging-1 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started messaging-2 * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Slave controller-3 * redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 * redis-bundle-2 (ocf::heartbeat:redis): Master controller-2 * ip-192.168.24.150 (ocf::heartbeat:IPaddr2): Started controller-3 * ip-10.0.0.150 (ocf::heartbeat:IPaddr2): Started controller-1 * ip-172.17.1.151 (ocf::heartbeat:IPaddr2): Started controller-2 * ip-172.17.1.150 (ocf::heartbeat:IPaddr2): Started controller-3 * ip-172.17.3.150 (ocf::heartbeat:IPaddr2): Started controller-1 * ip-172.17.4.150 (ocf::heartbeat:IPaddr2): Started controller-2 * Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]: * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-3 * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1 * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-2 * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-2 * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-3 * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1 * ip-172.17.1.65 (ocf::heartbeat:IPaddr2): Started controller-2 * stonith-fence_compute-fence-nova (stonith:fence_compute): Started messaging-1 * Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]: * Started: [ compute-0 compute-1 compute-2 compute-3 ] * Stopped: [ controller-1 controller-2 controller-3 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ] * nova-evacuate (ocf::openstack:NovaEvacuate): Started messaging-0 * stonith-fence_ipmilan-52540034fed2 (stonith:fence_ipmilan): Started database-2 * stonith-fence_ipmilan-5254002d92e9 (stonith:fence_ipmilan): Started messaging-2 * stonith-fence_ipmilan-525400cc1955 (stonith:fence_ipmilan): Started database-0 * stonith-fence_ipmilan-5254007741f2 (stonith:fence_ipmilan): Started database-1 * stonith-fence_ipmilan-52540039cd82 (stonith:fence_ipmilan): Started messaging-0 * stonith-fence_ipmilan-525400a9fd5d (stonith:fence_ipmilan): Started messaging-1 * stonith-fence_ipmilan-525400528442 (stonith:fence_ipmilan): Started messaging-2 * stonith-fence_ipmilan-52540025606c (stonith:fence_ipmilan): Started database-2 * stonith-fence_ipmilan-52540002bfa5 (stonith:fence_ipmilan): Started database-2 * stonith-fence_ipmilan-5254003883a8 (stonith:fence_ipmilan): Started database-0 * stonith-fence_ipmilan-5254002473d9 (stonith:fence_ipmilan): Started database-1 * Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]: * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-3 * compute-2 (ocf::pacemaker:remote): Started database-0 * compute-3 (ocf::pacemaker:remote): Started database-1 Failed Resource Actions: * stonith-fence_compute-fence-nova_start_0 on messaging-0 'error' (1): call=232, status='complete', exitreason='', last-rc-change='2021-03-27 05:16:19Z', queued=0ms, exec=6176ms Failed Fencing Actions: * unfencing of compute-0 failed: delegate=controller-1, client=pacemaker-controld.109474, origin=messaging-0, last-failed='2021-03-27 05:17:40Z' * unfencing of compute-1 failed: delegate=controller-0, client=pacemaker-controld.109474, origin=messaging-0, last-failed='2021-03-27 05:17:40Z' What needs investigation is galera bundle in maintenance mode? And why those unfence actions need to take place at all. [root@controller-1 ~]# rpm -q pacemaker pacemaker-2.0.5-9.el8.x86_64