Description of problem: When using instance ha feature, nova_compute container get stuck with "Waiting for fence-down flag to be cleared" message. Version-Release number of selected component (if applicable): 16.1.2 The environment (dpdk based) deployed using instance ha feature. When hypervisor node crashed, the instance recreated on the second hypervisor successfully. Crashed hypervisor node rebooted and nova_compute contained get stuck on the "/var/lib/nova/instanceha/check-run-nova-compute" with the following message: "Waiting for fence-down flag to be cleared". While the vm recreated on the second hypervisor and seen by using the "virsh list" command, on the first "failed" hypervisor, the same vm could be seen in the "power off" state. If the vm is undefined and compute node rebooted, it get stuck on the same nova_compute container message. Output of the pcs status: ######################################################################## [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Cluster Summary: * Stack: corosync * Current DC: controller-2 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum * Last updated: Tue Nov 3 14:41:46 2020 * Last change: Tue Nov 3 11:09:36 2020 by root via cibadmin on controller-0 * 14 nodes configured * 60 resource instances configured Node List: * Online: [ controller-0 controller-1 controller-2 ] * RemoteOnline: [ computeovsdpdksriov-0 ] * RemoteOFFLINE: [ computeovsdpdksriov-1 ] * GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 galera-bundle-2@controller-1 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 rabbitmq-bundle-2@controller-1 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ] Full List of Resources: * computeovsdpdksriov-0 (ocf::pacemaker:remote): Started controller-0 * computeovsdpdksriov-1 (ocf::pacemaker:remote): Stopped * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]: * galera-bundle-0 (ocf::heartbeat:galera): Master controller-2 * galera-bundle-1 (ocf::heartbeat:galera): Master controller-0 * galera-bundle-2 (ocf::heartbeat:galera): Master controller-1 * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Master controller-2 * redis-bundle-1 (ocf::heartbeat:redis): Slave controller-0 * redis-bundle-2 (ocf::heartbeat:redis): Slave controller-1 * ip-192.0.10.7 (ocf::heartbeat:IPaddr2): Started controller-2 * ip-10.35.141.97 (ocf::heartbeat:IPaddr2): Started controller-0 * ip-10.10.100.198 (ocf::heartbeat:IPaddr2): Started controller-1 * ip-10.10.100.132 (ocf::heartbeat:IPaddr2): Started controller-2 * ip-10.10.102.193 (ocf::heartbeat:IPaddr2): Started controller-0 * ip-10.10.103.105 (ocf::heartbeat:IPaddr2): Started controller-1 * Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]: * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-2 * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-0 * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-1 * stonith-fence_compute-fence-nova (stonith:fence_compute): Started controller-1 * Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]: * Started: [ computeovsdpdksriov-0 ] * Stopped: [ computeovsdpdksriov-1 controller-0 controller-1 controller-2 ] * nova-evacuate (ocf::openstack:NovaEvacuate): Started controller-2 * stonith-fence_ipmilan-5254007e7721 (stonith:fence_ipmilan): Started controller-2 * stonith-fence_ipmilan-5254004cd6b7 (stonith:fence_ipmilan): Started controller-1 * stonith-fence_ipmilan-801844f28bdd (stonith:fence_ipmilan): Started controller-0 * stonith-fence_ipmilan-801844f288d5 (stonith:fence_ipmilan): Started controller-1 * stonith-fence_ipmilan-525400ec2a0e (stonith:fence_ipmilan): Started controller-2 * Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]: * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-0 Failed Resource Actions: * computeovsdpdksriov-1_start_0 on controller-1 'error' (1): call=22, status='Timed Out', exitreason='', last-rc-change='2020-11-03 11:25:06Z', queued=0ms, exec=0ms * computeovsdpdksriov-1_start_0 on controller-2 'error' (1): call=18, status='Timed Out', exitreason='', last-rc-change='2020-11-03 11:26:05Z', queued=0ms, exec=0ms * computeovsdpdksriov-1_start_0 on controller-0 'error' (1): call=21, status='Timed Out', exitreason='', last-rc-change='2020-11-03 11:27:03Z', queued=0ms, exec=0ms Failed Fencing Actions: * unfencing of controller-0 failed: delegate=, client=pacemaker-controld.25646, origin=controller-2, last-failed='2020-11-03 11:08:16Z' Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled ######################################################################## The bug is very similar to the following BZ in OSP 13: https://bugzilla.redhat.com/show_bug.cgi?id=1703946
Sosreports available in the following link: http://file.mad.redhat.com/~mbabushk/sosreports/bz1894097/
(In reply to Maxim Babushkin from comment #1) > Sosreports available in the following link: > http://file.mad.redhat.com/~mbabushk/sosreports/bz1894097/ Hi Maxim, can you please double check the permissions? I can't download those sosreports (getting 403). thanks Luca
Hi Luca, Fixed. Thanks.
Hi Luca, I applied the patch details from the lp bug. The reconnect interval set to 300. The hypervisor still stuck after the crash is the same error - "Waiting for fence-down flag to be cleared".
(In reply to Maxim Babushkin from comment #7) > Hi Luca, > > I applied the patch details from the lp bug. > The reconnect interval set to 300. > > The hypervisor still stuck after the crash is the same error - "Waiting for > fence-down flag to be cleared". Hi Maxim, Could we have access to the env to analyze it? Thanks, Michele
Daniel, Our team is currently focused on the OVN DPDK + SRIOV testing. These are urgent tasks. Once finish, I will verify this bz.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0817
The solution for the issue reported in this BZ is *not* to update to the puppet-pacemaker referenced in the "fixed in version" field, but to also set a proper value for the pacemaker_remote_reconnect_interval like the following: ExtraConfig: pacemaker_remote_reconnect_interval: XXX Prior versions of puppet-pacemaker did not allow this parameter to be modified. Any server that takes more than the default amount of time to come back online (from a pacemaker perspective) will not be unfenced properly and thus would show the "Waiting for fence-down flag to be cleared" message. Operators should test and set a proper value for this parameter to give their servers enough time to boot and pacemaker to re-enable nova_compute.