Bug 2127285 - Instance HA compute nodes not coming up after outage
Summary: Instance HA compute nodes not coming up after outage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.1 (Wallaby)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z2
: 17.1
Assignee: Luca Miccini
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-09-15 20:21 UTC by nalmond
Modified: 2024-01-16 14:31 UTC (History)
13 users (show)

Fixed In Version: openstack-tripleo-heat-templates-14.3.1-17.1.20231103010821.e7c7ce3.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-01-16 14:31:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1764883 0 None None None 2022-09-16 14:42:35 UTC
OpenStack gerrit 562284 0 None MERGED compute: Ensure pre-migrating instances are destroyed during init_host 2023-03-20 14:32:33 UTC
OpenStack gerrit 564024 0 None MERGED Instance HA: prevent compute to start on a host being evacuated 2023-03-20 14:32:33 UTC
OpenStack gerrit 889698 0 None MERGED Partially revert "Instance HA: prevent compute to start on a host being evacuated" 2023-08-10 11:05:02 UTC
Red Hat Bugzilla 1567606 0 medium CLOSED Nova reports overcloud instance in error state after failed double compute failover instance-ha evacuation 2023-03-21 18:47:29 UTC
Red Hat Issue Tracker OSP-18738 0 None None None 2022-09-15 20:35:17 UTC
Red Hat Product Errata RHBA-2024:0209 0 None None None 2024-01-16 14:31:58 UTC

Description nalmond 2022-09-15 20:21:27 UTC
Description of problem:
Following a controlplane outage as a result of a traffic flood, multiple compute nodes were down in 'openstack compute service list'. The controlplane had been restored but these compute nodes were still down. Looking in the output of 'podman logs nova_compute' on the affected compute nodes, we saw this repeatedly:

Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
+ exec /var/lib/nova/instanceha/check-run-nova-compute
Checking 169 migrations
Waiting for evacuations to complete or fail
Checking 169 migrations
Waiting for evacuations to complete or fail
...truncated...

Logging in /var/log/containers/nova/nova-compute.log had stopped.

There were only about 30 instances on this node, nowhere near 169. The evacuations did not appear to be in progress, only 1 instance was in "MIGRATING" and it was on a different compute node than the above messages were captured from (though it was also on an affected node). It does not look like the nodes successfully fenced.

We were able to recover from this with a manual code change to the script in /var/lib/nova/instanceha/check-run-nova-compute but this should not be necessary. Is there a better way to recover from this or even better prevent this state in the first place?

Version-Release number of selected component (if applicable):
16.1.8

How reproducible:
In one environment following a network outage.

Steps to Reproduce:
1. Experience controlplane network outage
2. Stop outage
3. Attempt to restore overcloud services

Comment 15 Luca Miccini 2023-08-10 11:03:46 UTC
let's use this bz to track the backport of https://review.opendev.org/c/openstack/tripleo-heat-templates/+/564024

Comment 17 Luca Miccini 2023-08-10 11:06:24 UTC
(In reply to Luca Miccini from comment #15)
> let's use this bz to track the backport of
> https://review.opendev.org/c/openstack/tripleo-heat-templates/+/564024

obviously meant revert instead of backport: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/889698

Comment 22 Luca Miccini 2023-11-14 06:57:15 UTC
[stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS-17.1-RHEL-9-20231110.n.1

[root@compute-0 ~]# podman exec -it -u root nova_compute bash
[root@compute-0 /]# cat /run_command 
/usr/bin/nova-compute

[root@compute-0 ~]# ls /var/lib/nova/instanceha/check-run-nova-compute
ls: cannot access '/var/lib/nova/instanceha/check-run-nova-compute': No such file or directory

Comment 31 errata-xmlrpc 2024-01-16 14:31:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.2 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0209


Note You need to log in before you can comment on or make changes to this bug.