Bug 2127285

Summary: Instance HA compute nodes not coming up after outage
Product: Red Hat OpenStack Reporter: nalmond
Component: openstack-tripleo-heat-templatesAssignee: Luca Miccini <lmiccini>
Status: MODIFIED --- QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: dasmith, dhill, eglynn, jhakimra, kchamart, lmiccini, mburns, ratailor, sbauza, sgordon, smooney, vromanso
Target Milestone: z2Keywords: Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-14.3.1-17.1.20230806061110.e63d633.el9osttrunk Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nalmond 2022-09-15 20:21:27 UTC
Description of problem:
Following a controlplane outage as a result of a traffic flood, multiple compute nodes were down in 'openstack compute service list'. The controlplane had been restored but these compute nodes were still down. Looking in the output of 'podman logs nova_compute' on the affected compute nodes, we saw this repeatedly:

Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
+ exec /var/lib/nova/instanceha/check-run-nova-compute
Checking 169 migrations
Waiting for evacuations to complete or fail
Checking 169 migrations
Waiting for evacuations to complete or fail
...truncated...

Logging in /var/log/containers/nova/nova-compute.log had stopped.

There were only about 30 instances on this node, nowhere near 169. The evacuations did not appear to be in progress, only 1 instance was in "MIGRATING" and it was on a different compute node than the above messages were captured from (though it was also on an affected node). It does not look like the nodes successfully fenced.

We were able to recover from this with a manual code change to the script in /var/lib/nova/instanceha/check-run-nova-compute but this should not be necessary. Is there a better way to recover from this or even better prevent this state in the first place?

Version-Release number of selected component (if applicable):
16.1.8

How reproducible:
In one environment following a network outage.

Steps to Reproduce:
1. Experience controlplane network outage
2. Stop outage
3. Attempt to restore overcloud services

Comment 15 Luca Miccini 2023-08-10 11:03:46 UTC
let's use this bz to track the backport of https://review.opendev.org/c/openstack/tripleo-heat-templates/+/564024

Comment 17 Luca Miccini 2023-08-10 11:06:24 UTC
(In reply to Luca Miccini from comment #15)
> let's use this bz to track the backport of
> https://review.opendev.org/c/openstack/tripleo-heat-templates/+/564024

obviously meant revert instead of backport: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/889698