Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1943772

Summary: [16.2] Controller replacement can fail at step 5 in a composable IHA env
Product: Red Hat OpenStack Reporter: Michele Baldessari <michele>
Component: puppet-tripleoAssignee: Michele Baldessari <michele>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: dciabrin, jjoyce, jschluet, lmiccini, mburns, slinaber, tvignaud
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: puppet-tripleo-11.6.2-2.20210428172107.5c76ddc.el8ost.2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-15 07:13:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michele Baldessari 2021-03-27 09:10:26 UTC
Description of problem:
At least once so far I observed the controller replacement process failing at step 5 with a timeout:
2021-03-27 06:55:39.990599 | 52540007-3e3c-1bdb-5657-00000000d11e |     TIMING | Wait for containers to start for step 5 using paunch | controller-1 | 1:13:59.510591 | 1254.49s


The timeout happened while waiting for the cinder-volume bundle to be restarted.

I believe the restart failed because there are pending actions in the cluster:
Transition Summary:
 * Fence (on) compute-1 'required by compute-unfence-trigger:1 start'
 * Fence (on) compute-0 'required by compute-unfence-trigger:0 start'
 * Restart    compute-unfence-trigger:0     ( compute-0 )   due to required stonith
 * Restart    compute-unfence-trigger:1     ( compute-1 )   due to required stonith

The cluster status is the following:
Node List:
  * GuestNode galera-bundle-0@database-0: maintenance
  * GuestNode galera-bundle-1@database-1: maintenance
  * GuestNode galera-bundle-2@database-2: maintenance
  * Online: [ controller-1 controller-2 controller-3 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
  * RemoteOnline: [ compute-0 compute-1 compute-2 compute-3 ]
  * GuestOnline: [ ovn-dbs-bundle-0@controller-2 ovn-dbs-bundle-1@controller-3 ovn-dbs-bundle-2@controller-1 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1@messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-3 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full List of Resources:
  * compute-0   (ocf::pacemaker:remote):         Started controller-1
  * compute-1   (ocf::pacemaker:remote):         Started controller-2
  * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest] (unmanaged):
    * galera-bundle-0   (ocf::heartbeat:galera):         Master database-0 (unmanaged)
    * galera-bundle-1   (ocf::heartbeat:galera):         Master database-1 (unmanaged)
    * galera-bundle-2   (ocf::heartbeat:galera):         Master database-2 (unmanaged)
  * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):       Started messaging-0
    * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):       Started messaging-1
    * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):       Started messaging-2
  * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]:
    * redis-bundle-0    (ocf::heartbeat:redis):  Slave controller-3
    * redis-bundle-1    (ocf::heartbeat:redis):  Slave controller-1
    * redis-bundle-2    (ocf::heartbeat:redis):  Master controller-2
  * ip-192.168.24.150   (ocf::heartbeat:IPaddr2):        Started controller-3
  * ip-10.0.0.150       (ocf::heartbeat:IPaddr2):        Started controller-1
  * ip-172.17.1.151     (ocf::heartbeat:IPaddr2):        Started controller-2
  * ip-172.17.1.150     (ocf::heartbeat:IPaddr2):        Started controller-3
  * ip-172.17.3.150     (ocf::heartbeat:IPaddr2):        Started controller-1
  * ip-172.17.4.150     (ocf::heartbeat:IPaddr2):        Started controller-2
  * Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0   (ocf::heartbeat:podman):         Started controller-3
    * haproxy-bundle-podman-1   (ocf::heartbeat:podman):         Started controller-1
    * haproxy-bundle-podman-2   (ocf::heartbeat:podman):         Started controller-2
  * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0  (ocf::ovn:ovndb-servers):        Master controller-2
    * ovn-dbs-bundle-1  (ocf::ovn:ovndb-servers):        Slave controller-3
    * ovn-dbs-bundle-2  (ocf::ovn:ovndb-servers):        Slave controller-1
  * ip-172.17.1.65      (ocf::heartbeat:IPaddr2):        Started controller-2
  * stonith-fence_compute-fence-nova    (stonith:fence_compute):         Started messaging-1
  * Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]:
    * Started: [ compute-0 compute-1 compute-2 compute-3 ]
    * Stopped: [ controller-1 controller-2 controller-3 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
  * nova-evacuate       (ocf::openstack:NovaEvacuate):   Started messaging-0
  * stonith-fence_ipmilan-52540034fed2  (stonith:fence_ipmilan):         Started database-2
  * stonith-fence_ipmilan-5254002d92e9  (stonith:fence_ipmilan):         Started messaging-2
  * stonith-fence_ipmilan-525400cc1955  (stonith:fence_ipmilan):         Started database-0
  * stonith-fence_ipmilan-5254007741f2  (stonith:fence_ipmilan):         Started database-1
  * stonith-fence_ipmilan-52540039cd82  (stonith:fence_ipmilan):         Started messaging-0
  * stonith-fence_ipmilan-525400a9fd5d  (stonith:fence_ipmilan):         Started messaging-1
  * stonith-fence_ipmilan-525400528442  (stonith:fence_ipmilan):         Started messaging-2
  * stonith-fence_ipmilan-52540025606c  (stonith:fence_ipmilan):         Started database-2
  * stonith-fence_ipmilan-52540002bfa5  (stonith:fence_ipmilan):         Started database-2
  * stonith-fence_ipmilan-5254003883a8  (stonith:fence_ipmilan):         Started database-0
  * stonith-fence_ipmilan-5254002473d9  (stonith:fence_ipmilan):         Started database-1
  * Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0  (ocf::heartbeat:podman):         Started controller-3
  * compute-2   (ocf::pacemaker:remote):         Started database-0
  * compute-3   (ocf::pacemaker:remote):         Started database-1

Failed Resource Actions:
  * stonith-fence_compute-fence-nova_start_0 on messaging-0 'error' (1): call=232, status='complete', exitreason='', last-rc-change='2021-03-27 05:16:19Z', queued=0ms, exec=6176ms

Failed Fencing Actions:
  * unfencing of compute-0 failed: delegate=controller-1, client=pacemaker-controld.109474, origin=messaging-0, last-failed='2021-03-27 05:17:40Z'
  * unfencing of compute-1 failed: delegate=controller-0, client=pacemaker-controld.109474, origin=messaging-0, last-failed='2021-03-27 05:17:40Z'


What needs investigation is galera bundle in maintenance mode? And why those unfence actions need to take place at all.

[root@controller-1 ~]# rpm -q pacemaker
pacemaker-2.0.5-9.el8.x86_64

Comment 1 Michele Baldessari 2021-03-29 07:45:42 UTC
I can reproduce this one quite often.

The galera manage is something done here:
https://github.com/rhos-infra/cloud-config/blob/master/post_tasks/roles/replace-controller/tasks/manual_preparation.yml#L48

So I think we can ignore that for now.

I am not sure yet as to why the restart fail during the deploy.
In a cluster with the following pending actions:
Transition Summary:
 * Fence (on) compute-1 'required by compute-unfence-trigger:1 start'
 * Fence (on) compute-0 'required by compute-unfence-trigger:0 start'
 * Restart    compute-unfence-trigger:0     ( compute-0 )   due to required stonith
 * Restart    compute-unfence-trigger:1     ( compute-1 )   due to required stonith

The following manual command still seems to work:
[root@controller-1 ~]# date; pcs resource restart --wait=300 openstack-cinder-volume ; date
Mon Mar 29 07:41:24 UTC 2021
openstack-cinder-volume successfully restarted
Mon Mar 29 07:41:58 UTC 2021

The error is:
2021-03-29 07:19:12.935 219900 DEBUG paunch [  ] b'Mon Mar 29 06:59:09 UTC 2021: Restarting openstack-cinder-volume globally. Stopping:\nMon Mar 29 07:09:11 UTC 2021: Restarting openstack-cinder-volume globally. Starting:\n'
2021-03-29 07:19:12.935 219900 DEBUG paunch [  ] b'time="2021-03-29T06:59:08Z" level=error msg="Error loading CNI config list file /etc/cni/net.d/87-podman-bridge.conflist: error parsing configuration list: no name"\nError: waiting timeout\n\ncrm_resource: Error performing operation: Timer expired\nPending actions:\n\tAction 303: compute-unfence-trigger:1_start_0\ton compute-1\n\tAction 302: compute-unfence-trigger:1_stop_0\ton compute-1\n\tAction 301: compute-unfence-trigger:0_start_0\ton compute-0\n\tAction 300: compute-unfence-trigger:0_stop_0\ton compute-0\n\tAction 87: stonith-compute-0-on\ton compute-0\n\tAction 86: stonith-compute-1-on\ton compute-1\n\tAction 48: compute-unfence-trigger:1_monitor_10000\ton compute-1\n\tAction 47: compute-unfence-trigger:0_monitor_10000\ton compute-0\nError: waiting timeout\n\ncrm_resource: Error performing operation: Timer expired\nPending actions:\n\tAction 304: compute-unfence-trigger:1_start_0\ton compute-1\n\tAction 303: compute-unfence-trigger:1_stop_0\ton compute-1\n\tAction 302: compute-unfence-trigger:0_start_0\ton compute-0\n\tAction 301: compute-unfence-trigger:0_stop_0\ton compute-0\n\tAction 88: stonith-compute-0-on\ton compute-0\n\tAction 87: stonith-compute-1-on\ton compute-1\n\tAction 48: compute-unfence-trigger:1_monitor_10000\ton compute-1\n\tAction 47: compute-unfence-trigger:0_monitor_10000\ton compute-0\n'
2021-03-29 07:19:12.936 219900 ERROR paunch [  ] Error running ['podman', 'run', '--name', 'cinder_volume_restart_bundle', '--label', 'config_id=tripleo_step5', '--label', 'container_name=cinder_volume_restart_bundle', '--label', 'managed_by=tripleo-ControllerOpenstack', '--label', 'config_data={"command": "/var/lib/container-config-scripts/pacemaker_restart_bundle.sh cinder_volume openstack-cinder-volume openstack-cinder-volume _ Started", "config_volume": "cinder", "detach": false, "environment": {"TRIPLEO_MINOR_UPDATE": "", "TRIPLEO_CONFIG_HASH": "67bb174a47812a3a4e05a13992175087"}, "image": "undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:16.2_20210323.1-hotfixupdate2", "ipc": "host", "net": "host", "start_order": 2, "user": "root", "volumes": ["/etc/hosts:/etc/hosts:ro", "/etc/localtime:/etc/localtime:ro", "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro", "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro", "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro", "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro", "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro", "/dev/log:/dev/log", "/etc/ipa/ca.crt:/etc/ipa/ca.crt:ro", "/var/lib/container-config-scripts:/var/lib/container-config-scripts:ro", "/dev/shm:/dev/shm:rw", "/etc/puppet:/etc/puppet:ro", "/var/lib/config-data/puppet-generated/cinder:/var/lib/kolla/config_files/src:ro"]}', '--conmon-pidfile=/var/run/cinder_volume_restart_bundle.pid', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/cinder_volume_restart_bundle.log', '--env=TRIPLEO_CONFIG_HASH=67bb174a47812a3a4e05a13992175087', '--env=TRIPLEO_MINOR_UPDATE', '--net=host', '--ipc=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ipa/ca.crt:/etc/ipa/ca.crt:ro', '--volume=/var/lib/container-config-scripts:/var/lib/container-config-scripts:ro', '--volume=/dev/shm:/dev/shm:rw', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/config-data/puppet-generated/cinder:/var/lib/kolla/config_files/src:ro', 'undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:16.2_20210323.1-hotfixupdate2', '/var/lib/container-config-scripts/pacemaker_restart_bundle.sh', 'cinder_volume', 'openstack-cinder-volume', 'openstack-cinder-volume', '_', 'Started']. [1]

But this also means that the pcs on host has not yet landed on 16.2 composes, which is also quite concerning.

Comment 2 Michele Baldessari 2021-04-11 08:30:56 UTC
As kind of expected, this happens also with pcs on host:
2021-04-11 06:04:42.686199 | 5254001f-5135-53b1-9a16-00000000ccb2 |      FATAL | Run pacemaker restart if the config file for the service changed | controller-1 | error={"changed": false, "error": "Failed running command", "msg": "Error running /var/lib/container-config-scripts/pacemaker_restart_bundle.sh cinder_volume openstack-cinder-volume openstack-cinder-volume _ Started. rc: 1, stdout: Sun Apr 11 05:44:39 UTC 2021: Restarting openstack-cinder-volume globally. Stopping:\nSun Apr 11 05:54:41 UTC 2021: Restarting openstack-cinder-volume globally. Starting:\n, stderr: Error: waiting timeout\n\ncrm_resource: Error performing operation: Timer expired\nPending actions:\n\tAction 305: compute-unfence-trigger:3_monitor_10000\ton compute-1\n\tAction 304: compute-unfence-trigger:3_start_0\ton compute-1\n\tAction 303: compute-unfence-trigger:2_monitor_10000\ton compute-0\n\tAction 302: compute-unfence-trigger:2_start_0\ton compute-0\n\tAction 85: stonith-compute-0-on\ton compute-0\n\tAction 84: stonith-compute-1-on\ton compute-1\nError: waiting timeout\n\ncrm_resource: Error performing operation: Timer expired\nPending actions:\n\tAction 307: compute-unfence-trigger:3_monitor_10000\ton compute-0\n\tAction 306: compute-unfence-trigger:3_start_0\ton compute-0\n\tAction 301: compute-unfence-trigger:0_start_0\ton compute-1\n\tAction 300: compute-unfence-trigger:0_stop_0\ton compute-1\n\tAction 87: stonith-compute-0-on\ton compute-0\n\tAction 86: stonith-compute-1-on\ton compute-1\n\tAction 47: compute-unfence-trigger:0_monitor_10000\ton compute-1\n"}
Overcloud configuration failed.
2021-04-11 06:04:42.686800 | 5254001f-5135-53b1-9a16-00000000ccb2 |     TIMING | tripleo_ha_wrapper : Run pacemaker restart if the config file for the service changed | controller-1 | 1:02:23.064958 | 1204.44s


Running it by hand also fails, likely due to the fact that there are a number of pending actions:
[root@controller-1 ~]# /var/lib/container-config-scripts/pacemaker_restart_bundle.sh cinder_volume openstack-cinder-volume openstack-cinder-volume _ Started
Sun Apr 11 08:14:17 UTC 2021: Restarting openstack-cinder-volume globally. Stopping:
Error: waiting timeout

crm_resource: Error performing operation: Timer expired
Pending actions:
	Action 306: compute-unfence-trigger:3_monitor_10000	on compute-1
	Action 305: compute-unfence-trigger:3_start_0	on compute-1
	Action 300: compute-unfence-trigger:0_start_0	on compute-0
	Action 299: compute-unfence-trigger:0_stop_0	on compute-0
	Action 86: stonith-compute-0-on	on compute-0
	Action 85: stonith-compute-1-on	on compute-1
	Action 47: compute-unfence-trigger:0_monitor_10000	on compute-0
Sun Apr 11 08:24:20 UTC 2021: Restarting openstack-cinder-volume globally. Starting:

Comment 3 Michele Baldessari 2021-04-11 08:41:05 UTC
Ops paste got cut:
Sun Apr 11 08:24:20 UTC 2021: Restarting openstack-cinder-volume globally. Starting:

Error: waiting timeout

crm_resource: Error performing operation: Timer expired
Pending actions:
	Action 307: compute-unfence-trigger:3_monitor_10000	on compute-0
	Action 306: compute-unfence-trigger:3_start_0	on compute-0
	Action 301: compute-unfence-trigger:0_start_0	on compute-1
	Action 300: compute-unfence-trigger:0_stop_0	on compute-1
	Action 87: stonith-compute-0-on	on compute-0
	Action 86: stonith-compute-1-on	on compute-1
	Action 47: compute-unfence-trigger:0_monitor_10000	on compute-1
[root@controller-1 ~]#


crm_simulate -Ls gives:

Transition Summary:
 * Fence (on) compute-1 'required by compute-unfence-trigger:0 start'
 * Fence (on) compute-0 'required by compute-unfence-trigger:3 start'
 * Restart    compute-unfence-trigger:0     ( compute-1 )   due to required stonith
 * Start      compute-unfence-trigger:3     ( compute-0 )


Even after doing a full "pcs resource cleanup" + "pcs stonith cleanup" we are still left with:
Transition Summary:
 * Fence (on) compute-1 'required by compute-unfence-trigger:3 start'
 * Fence (on) compute-0 'required by compute-unfence-trigger:2 start'
 * Start      compute-unfence-trigger:2     ( compute-0 )
 * Start      compute-unfence-trigger:3     ( compute-1 )

Full status being:
[root@controller-1 ~]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: messaging-0 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
  * Last updated: Sun Apr 11 08:37:40 2021
  * Last change:  Sun Apr 11 08:35:31 2021 by hacluster via crmd on messaging-1
  * 25 nodes configured
  * 89 resource instances configured

Node List:
  * GuestNode galera-bundle-0@database-0: maintenance
  * GuestNode galera-bundle-1@database-1: maintenance
  * GuestNode galera-bundle-2@database-2: maintenance
  * Online: [ controller-1 controller-2 controller-3 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
  * RemoteOnline: [ compute-0 compute-1 compute-2 compute-3 ]
  * GuestOnline: [ ovn-dbs-bundle-0@controller-2 ovn-dbs-bundle-1@controller-3 ovn-dbs-bundle-2@controller-1 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1@messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-3 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full List of Resources:
  * compute-0	(ocf::pacemaker:remote):	 Started controller-1
  * compute-1	(ocf::pacemaker:remote):	 Started controller-2
  * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest] (unmanaged):
    * galera-bundle-0	(ocf::heartbeat:galera):	 Master database-0 (unmanaged)
    * galera-bundle-1	(ocf::heartbeat:galera):	 Master database-1 (unmanaged)
    * galera-bundle-2	(ocf::heartbeat:galera):	 Master database-2 (unmanaged)
  * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	 Started messaging-0
    * rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	 Started messaging-1
    * rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	 Started messaging-2
  * ip-192.168.24.150	(ocf::heartbeat:IPaddr2):	 Started controller-3
  * ip-10.0.0.150	(ocf::heartbeat:IPaddr2):	 Started controller-1
  * ip-172.17.1.151	(ocf::heartbeat:IPaddr2):	 Started controller-2
  * ip-172.17.1.150	(ocf::heartbeat:IPaddr2):	 Started controller-3
  * ip-172.17.3.150	(ocf::heartbeat:IPaddr2):	 Started controller-1
  * ip-172.17.4.150	(ocf::heartbeat:IPaddr2):	 Started controller-2
  * Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0	(ocf::heartbeat:podman):	 Started controller-3
    * haproxy-bundle-podman-1	(ocf::heartbeat:podman):	 Started controller-1
    * haproxy-bundle-podman-2	(ocf::heartbeat:podman):	 Started controller-2
  * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]:
    * redis-bundle-0	(ocf::heartbeat:redis):	 Slave controller-3
    * redis-bundle-1	(ocf::heartbeat:redis):	 Slave controller-1
    * redis-bundle-2	(ocf::heartbeat:redis):	 Master controller-2
  * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	 Master controller-2
    * ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	 Slave controller-3
    * ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	 Slave controller-1
  * ip-172.17.1.124	(ocf::heartbeat:IPaddr2):	 Started controller-2
  * stonith-fence_compute-fence-nova	(stonith:fence_compute):	 Started messaging-2
  * Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]:
    * Started: [ compute-2 compute-3 ]
    * Stopped: [ compute-0 compute-1 controller-1 controller-2 controller-3 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
  * nova-evacuate	(ocf::openstack:NovaEvacuate):	 Started messaging-1
  * stonith-fence_ipmilan-525400ae968f	(stonith:fence_ipmilan):	 Started messaging-0
  * stonith-fence_ipmilan-52540004ccd5	(stonith:fence_ipmilan):	 Started database-2
  * stonith-fence_ipmilan-52540050b535	(stonith:fence_ipmilan):	 Started database-0
  * stonith-fence_ipmilan-52540069361a	(stonith:fence_ipmilan):	 Started database-1
  * stonith-fence_ipmilan-525400f1546f	(stonith:fence_ipmilan):	 Started messaging-1
  * stonith-fence_ipmilan-5254005a1b96	(stonith:fence_ipmilan):	 Started messaging-0
  * stonith-fence_ipmilan-525400d0dca3	(stonith:fence_ipmilan):	 Started messaging-2
  * stonith-fence_ipmilan-525400ec2aaa	(stonith:fence_ipmilan):	 Started database-2
  * stonith-fence_ipmilan-5254004b0708	(stonith:fence_ipmilan):	 Started messaging-2
  * stonith-fence_ipmilan-525400d72431	(stonith:fence_ipmilan):	 Started messaging-1
  * Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0	(ocf::heartbeat:podman):	 Started controller-3
  * stonith-fence_ipmilan-5254004dce18	(stonith:fence_ipmilan):	 Started messaging-0
  * compute-2	(ocf::pacemaker:remote):	 Started database-0
  * compute-3	(ocf::pacemaker:remote):	 Started database-1

Failed Fencing Actions:
  * unfencing of compute-1 failed: delegate=messaging-1, client=pacemaker-controld.102516, origin=messaging-0, last-failed='2021-04-11 04:40:06Z'
  * unfencing of compute-0 failed: delegate=controller-2, client=pacemaker-controld.102516, origin=messaging-0, last-failed='2021-04-11 04:38:55Z'
  * unfencing of compute-2 failed: delegate=, client=pacemaker-controld.102516, origin=messaging-0, last-failed='2021-04-11 04:30:06Z'
  * unfencing of compute-3 failed: delegate=, client=pacemaker-controld.102516, origin=messaging-0, last-failed='2021-04-11 04:30:05Z'

Pending Fencing Actions:
  * unfencing of compute-0 pending: client=pacemaker-controld.102516, origin=messaging-0
  * unfencing of compute-1 pending: client=pacemaker-controld.102516, origin=messaging-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

The problem seems to me that we're missing stonith devices for the compute nodes that we gained in the scaleup:
[root@controller-1 ~]# pcs stonith
  * stonith-fence_compute-fence-nova	(stonith:fence_compute):	 Started messaging-2
  * stonith-fence_ipmilan-525400ae968f	(stonith:fence_ipmilan):	 Started messaging-0
  * stonith-fence_ipmilan-52540004ccd5	(stonith:fence_ipmilan):	 Started database-2
  * stonith-fence_ipmilan-52540050b535	(stonith:fence_ipmilan):	 Started database-0
  * stonith-fence_ipmilan-52540069361a	(stonith:fence_ipmilan):	 Started database-1
  * stonith-fence_ipmilan-525400f1546f	(stonith:fence_ipmilan):	 Started messaging-1
  * stonith-fence_ipmilan-5254005a1b96	(stonith:fence_ipmilan):	 Started messaging-0
  * stonith-fence_ipmilan-525400d0dca3	(stonith:fence_ipmilan):	 Started messaging-2
  * stonith-fence_ipmilan-525400ec2aaa	(stonith:fence_ipmilan):	 Started database-2
  * stonith-fence_ipmilan-5254004b0708	(stonith:fence_ipmilan):	 Started messaging-2
  * stonith-fence_ipmilan-525400d72431	(stonith:fence_ipmilan):	 Started messaging-1
  * stonith-fence_ipmilan-5254004dce18	(stonith:fence_ipmilan):	 Started messaging-0
 Target: compute-0
   Level 1 - stonith-fence_ipmilan-525400ec2aaa,stonith-fence_compute-fence-nova
 Target: compute-1
   Level 1 - stonith-fence_ipmilan-52540069361a,stonith-fence_compute-fence-nova
 Target: controller-0
   Level 1 - stonith-fence_ipmilan-5254004dce18
 Target: controller-1
   Level 1 - stonith-fence_ipmilan-5254004b0708
 Target: controller-2
   Level 1 - stonith-fence_ipmilan-525400d72431
 Target: database-0
   Level 1 - stonith-fence_ipmilan-525400d0dca3
 Target: database-1
   Level 1 - stonith-fence_ipmilan-5254005a1b96
 Target: database-2
   Level 1 - stonith-fence_ipmilan-52540050b535
 Target: messaging-0
   Level 1 - stonith-fence_ipmilan-525400f1546f
 Target: messaging-1
   Level 1 - stonith-fence_ipmilan-52540004ccd5
 Target: messaging-2
   Level 1 - stonith-fence_ipmilan-525400ae968f


This is prolly more an issue how we code the scaling up really.

Comment 4 Michele Baldessari 2021-04-11 09:25:23 UTC
Ok the root cause is cloud-config. It creates a ~/newnodes.json instead of expanding the existing instackenv.json and it also never calls 'openstack overcloud generate fencing' so there is no stonith config for the new compute nodes that were scaled up before.

This also affects the controller replacement because there won't be any stonith resource created for the new controller (controller-3 in our case)

Comment 5 Michele Baldessari 2021-05-05 08:05:33 UTC
There are two aspects to this issue.
One issue is in https://review.opendev.org/c/openstack/puppet-tripleo/+/785863 which has merged upstream in train.

The other one is purely infrared related and is here: https://projects.engineering.redhat.com/browse/RHOSINFRA-4003

Comment 9 errata-xmlrpc 2021-09-15 07:13:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483