1894097 – [OSP 16.1.2] Instance HA fails with - "Waiting for fence-down flag to be cleared"

Bug 1894097 - [OSP 16.1.2] Instance HA fails with - "Waiting for fence-down flag to be cleared"

Summary: [OSP 16.1.2] Instance HA fails with - "Waiting for fence-down flag to be clea...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-pacemaker
Sub Component:
Version:	16.1 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z4
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	dabarzil
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1879088
TreeView+	depends on / blocked

Reported:	2020-11-03 14:42 UTC by Maxim Babushkin
Modified:	2021-04-25 08:05 UTC (History)
CC List:	17 users (show)
Fixed In Version:	puppet-pacemaker-1.0.1-1.20201114034949.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-17 15:33:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1905606	None	None	None	2020-11-25 18:12:48 UTC
OpenStack gerrit	764227	None	MERGED	Fix reconnect_interval on remotes with pcs 0.10	2021-02-19 08:51:02 UTC
Red Hat Product Errata	RHBA-2021:0817	None	None	None	2021-03-17 15:35:27 UTC

Description Maxim Babushkin 2020-11-03 14:42:34 UTC

Description of problem:
When using instance ha feature, nova_compute container get stuck with "Waiting for fence-down flag to be cleared" message.


Version-Release number of selected component (if applicable):
16.1.2


The environment (dpdk based) deployed using instance ha feature.
When hypervisor node crashed, the instance recreated on the second hypervisor successfully.
Crashed hypervisor node rebooted and nova_compute contained get stuck on the "/var/lib/nova/instanceha/check-run-nova-compute" with the following message:
"Waiting for fence-down flag to be cleared".

While the vm recreated on the second hypervisor and seen by using the "virsh list" command, on the first "failed" hypervisor, the same vm could be seen in the "power off" state.

If the vm is undefined and compute node rebooted, it get stuck on the same nova_compute container message.


Output of the pcs status:
########################################################################
[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: controller-2 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Tue Nov  3 14:41:46 2020
  * Last change:  Tue Nov  3 11:09:36 2020 by root via cibadmin on controller-0
  * 14 nodes configured
  * 60 resource instances configured

Node List:
  * Online: [ controller-0 controller-1 controller-2 ]
  * RemoteOnline: [ computeovsdpdksriov-0 ]
  * RemoteOFFLINE: [ computeovsdpdksriov-1 ]
  * GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 galera-bundle-2@controller-1 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 rabbitmq-bundle-2@controller-1 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ]

Full List of Resources:
  * computeovsdpdksriov-0	(ocf::pacemaker:remote):	Started controller-0
  * computeovsdpdksriov-1	(ocf::pacemaker:remote):	Stopped
  * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]:
    * galera-bundle-0	(ocf::heartbeat:galera):	Master controller-2
    * galera-bundle-1	(ocf::heartbeat:galera):	Master controller-0
    * galera-bundle-2	(ocf::heartbeat:galera):	Master controller-1
  * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
    * rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
    * rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
  * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]:
    * redis-bundle-0	(ocf::heartbeat:redis):	Master controller-2
    * redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-0
    * redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-1
  * ip-192.0.10.7	(ocf::heartbeat:IPaddr2):	Started controller-2
  * ip-10.35.141.97	(ocf::heartbeat:IPaddr2):	Started controller-0
  * ip-10.10.100.198	(ocf::heartbeat:IPaddr2):	Started controller-1
  * ip-10.10.100.132	(ocf::heartbeat:IPaddr2):	Started controller-2
  * ip-10.10.102.193	(ocf::heartbeat:IPaddr2):	Started controller-0
  * ip-10.10.103.105	(ocf::heartbeat:IPaddr2):	Started controller-1
  * Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0	(ocf::heartbeat:podman):	Started controller-2
    * haproxy-bundle-podman-1	(ocf::heartbeat:podman):	Started controller-0
    * haproxy-bundle-podman-2	(ocf::heartbeat:podman):	Started controller-1
  * stonith-fence_compute-fence-nova	(stonith:fence_compute):	Started controller-1
  * Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]:
    * Started: [ computeovsdpdksriov-0 ]
    * Stopped: [ computeovsdpdksriov-1 controller-0 controller-1 controller-2 ]
  * nova-evacuate	(ocf::openstack:NovaEvacuate):	Started controller-2
  * stonith-fence_ipmilan-5254007e7721	(stonith:fence_ipmilan):	Started controller-2
  * stonith-fence_ipmilan-5254004cd6b7	(stonith:fence_ipmilan):	Started controller-1
  * stonith-fence_ipmilan-801844f28bdd	(stonith:fence_ipmilan):	Started controller-0
  * stonith-fence_ipmilan-801844f288d5	(stonith:fence_ipmilan):	Started controller-1
  * stonith-fence_ipmilan-525400ec2a0e	(stonith:fence_ipmilan):	Started controller-2
  * Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0	(ocf::heartbeat:podman):	Started controller-0

Failed Resource Actions:
  * computeovsdpdksriov-1_start_0 on controller-1 'error' (1): call=22, status='Timed Out', exitreason='', last-rc-change='2020-11-03 11:25:06Z', queued=0ms, exec=0ms
  * computeovsdpdksriov-1_start_0 on controller-2 'error' (1): call=18, status='Timed Out', exitreason='', last-rc-change='2020-11-03 11:26:05Z', queued=0ms, exec=0ms
  * computeovsdpdksriov-1_start_0 on controller-0 'error' (1): call=21, status='Timed Out', exitreason='', last-rc-change='2020-11-03 11:27:03Z', queued=0ms, exec=0ms

Failed Fencing Actions:
  * unfencing of controller-0 failed: delegate=, client=pacemaker-controld.25646, origin=controller-2, last-failed='2020-11-03 11:08:16Z'

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
########################################################################


The bug is very similar to the following BZ in OSP 13:
https://bugzilla.redhat.com/show_bug.cgi?id=1703946

Comment 1 Maxim Babushkin 2020-11-03 14:43:59 UTC

Sosreports available in the following link:
http://file.mad.redhat.com/~mbabushk/sosreports/bz1894097/

Comment 2 Luca Miccini 2020-11-03 15:18:49 UTC

(In reply to Maxim Babushkin from comment #1)
> Sosreports available in the following link:
> http://file.mad.redhat.com/~mbabushk/sosreports/bz1894097/

Hi Maxim, can you please double check the permissions? I can't download those sosreports (getting 403).
thanks
Luca

Comment 3 Maxim Babushkin 2020-11-03 15:24:03 UTC

Hi Luca,

Fixed.
Thanks.

Comment 7 Maxim Babushkin 2020-11-24 18:36:21 UTC

Hi Luca,

I applied the patch details from the lp bug.
The reconnect interval set to 300.

The hypervisor still stuck after the crash is the same error - "Waiting for fence-down flag to be cleared".

Comment 8 Michele Baldessari 2020-11-24 19:01:18 UTC

(In reply to Maxim Babushkin from comment #7)
> Hi Luca,
> 
> I applied the patch details from the lp bug.
> The reconnect interval set to 300.
> 
> The hypervisor still stuck after the crash is the same error - "Waiting for
> fence-down flag to be cleared".
Hi Maxim,

Could we have access to the env to analyze it?

Thanks,
Michele

Comment 23 Maxim Babushkin 2021-02-01 09:48:23 UTC

Daniel,

Our team is currently focused on the OVN DPDK + SRIOV testing.
These are urgent tasks.

Once finish, I will verify this bz.

Comment 34 errata-xmlrpc 2021-03-17 15:33:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0817

Comment 36 Luca Miccini 2021-04-23 09:19:02 UTC

The solution for the issue reported in this BZ is *not* to update to the puppet-pacemaker referenced in the "fixed in version" field, but to also set a proper value for the  pacemaker_remote_reconnect_interval like the following:

  ExtraConfig:
    pacemaker_remote_reconnect_interval: XXX

Prior versions of puppet-pacemaker did not allow this parameter to be modified.

Any server that takes more than the default amount of time to come back online (from a pacemaker perspective) will not be unfenced properly and thus would show the "Waiting for fence-down flag to be cleared" message.

Operators should test and set a proper value for this parameter to give their servers enough time to boot and pacemaker to re-enable nova_compute.

Note You need to log in before you can comment on or make changes to this bug.