1977667 – Compute service DOWN after FFU from RHOSP13 to 16.1 because service version is still 30.

Bug 1977667 - Compute service DOWN after FFU from RHOSP13 to 16.1 because service version is still 30.

Summary: Compute service DOWN after FFU from RHOSP13 to 16.1 because service version i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z8
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Artom Lifshitz
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2092802 (view as bug list)
Depends On:
Blocks:	1987225
TreeView+	depends on / blocked

Reported:	2021-06-30 08:57 UTC by Rohini Diwakar
Modified:	2023-03-21 19:45 UTC (History)
CC List:	17 users (show)
Fixed In Version:	openstack-nova-20.4.1-1.20210809183306.1ee93b9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1986406 1987225 (view as bug list)
Environment:
Last Closed:	2022-03-24 11:04:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	801285	None	NEW	WIP: allow service delete even if compute node record is missing	2021-08-12 09:27:38 UTC
Red Hat Issue Tracker	OSP-5650	None	None	None	2021-11-15 13:08:34 UTC
Red Hat Issue Tracker	UPG-3185	None	None	None	2021-10-12 13:38:55 UTC
Red Hat Product Errata	RHSA-2022:0983	None	None	None	2022-03-24 11:04:58 UTC

Description Rohini Diwakar 2021-06-30 08:57:17 UTC

Description of problem:
After upgrade from osp13 to osp16.1 nova-compute and nova-consoleauth services are in DOWN state on controllers. Looks like stale entries and deleting the services results into the error: 
+++
Failed to delete compute service with ID 'dfde0572-447e-4efe-8200-daf76f39f098': Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.ComputeHostNotFound'> (HTTP 500) (Request-ID: req-a4313d40-a9e3-47ca-99c8-e11183235092)
+++

This is impacting migration of VMs.
+++
Instance has an associated NUMA topology, cell contains compute nodes older than train, and the enable_numa_live_migration workaround is disabled. Refusing to perform the live migration, as the instance NUMA topology, including related attributes such as CPU pinning, huge page and emulator thread pinning information, cannot be recalculated. See bug #1289064 for more information.
+++

Version-Release number of selected component (if applicable):
RHOSP16.1


Actual results:
nova-consoleauth & nova-compute service in DOWN state after upgrade.  

Expected results:
nova-consoleauth & nova-compute should have been cleaned up during upgrade.

Additional info:

Comment 2 Artom Lifshitz 2021-06-30 13:56:23 UTC

(In reply to Rohini Diwakar from comment #0)
> Description of problem:
> After upgrade from osp13 to osp16.1 nova-compute and nova-consoleauth
> services are in DOWN state on controllers. Looks like stale entries and
> deleting the services results into the error: 
> +++
> Failed to delete compute service with ID
> 'dfde0572-447e-4efe-8200-daf76f39f098': Unexpected API Error. Please report
> this at http://bugs.launchpad.net/nova/ and attach the Nova API log if
> possible.
> <class 'nova.exception.ComputeHostNotFound'> (HTTP 500) (Request-ID:
> req-a4313d40-a9e3-47ca-99c8-e11183235092)

I wasn't able to find this request ID in any of the logs in the sosreports on supportshell:

[alifshit@supportshell 02975383]$ grep -l req-a4313d40-a9e3-47ca-99c8-e11183235092 -R `!!`
grep -l req-a4313d40-a9e3-47ca-99c8-e11183235092 -R `find . -name '*nova*log*'`
[supportshell.prod.useraccess-us-west-2.redhat.com] [13:50:51+0000]
[alifshit@supportshell 02975383]$

[alifshit@supportshell 02975383]$ zgrep -l req-a4313d40-a9e3-47ca-99c8-e11183235092 -R `find . -name '*nova*log*'`
gzip: -R.gz: No such file or directory
[supportshell.prod.useraccess-us-west-2.redhat.com] [13:52:14+0000]
[alifshit@supportshell 02975383]$

I think that should be the place where we start looking - understanding the full error for why Nova wasn't able to delete the service. Any chance we could get at least the full stacktrace, or another set of sosreports uploaded that include that request ID? Thanks!

> +++
> 
> This is impacting migration of VMs.
> +++
> Instance has an associated NUMA topology, cell contains compute nodes older
> than train, and the enable_numa_live_migration workaround is disabled.
> Refusing to perform the live migration, as the instance NUMA topology,
> including related attributes such as CPU pinning, huge page and emulator
> thread pinning information, cannot be recalculated. See bug #1289064 for
> more information.
> +++
> 
> Version-Release number of selected component (if applicable):
> RHOSP16.1
> 
> 
> Actual results:
> nova-consoleauth & nova-compute service in DOWN state after upgrade.  
> 
> Expected results:
> nova-consoleauth & nova-compute should have been cleaned up during upgrade.
> 
> Additional info:

Comment 3 Artom Lifshitz 2021-07-07 11:19:21 UTC

Hi again - another question. Manually deleting nova-compute services is not part of the 13 to 16.1 FFU process. What initiated the delete request? The upgrade tooling, or a human being, attempting to clean up left over services? Thanks!

Comment 7 Artom Lifshitz 2021-07-19 12:04:43 UTC

@Luigi: Your thing isn't actually related to FFU, and I suspect can be reproduced with just a controller replacement. I've started a WIP patch [1] to fix it.

@Rohini: The stack trace in your case is exactly the same as Luigi's, but in your case there's no mention of a controller replacement. As far as I can tell, a simple FFU should not cause the behaviour that you're seeing. Are you sure there isn't a controller replacement, or a similar operation, going on here?

[1] https://review.opendev.org/c/openstack/nova/+/801285

Comment 8 smooney 2021-07-19 14:23:49 UTC

as far as we can tell there seam too be a gap in the FFU proceedure for ironic.
nova is working correctly in prevent numa live migration since the ironic compute services have not been upgraded so that is not a bug.
in ironic deployement the ironic virt driver runs in compute services deployed on the contolers.
this seams to not have been taken into account so the hard_prov dfg and upgrades dfg need to come to an agreement on how to adress this.

the current workflow for upgrades is that all compute services must be fully upgrade or in the hybrid state before any migration may happen.
as such from a compute dfg perspective we would expect all 3 contoler to be fully upgraded including the ironc compute service instnaces prior
to putting the other comptue services into the hybrid state(new containers old rhel) and then and only then shoudl any migration happen as compute hosts are leapp upgraded form rhel 7 to rhel 8.

moving this over to the hard_prov dfg to triage and determin what workflow is best in this case.

Comment 23 Artom Lifshitz 2021-08-04 17:04:55 UTC

Because this is getting a bit confusing, here's a summary that I'll reproduce in the other affected BZs.

BZ 1986406: Documentation for controller replacement

BZ 1990034: Documentation for FFU with Ironic as the virt driver

BZ 1977667 (this BZ): Nova fix to allow service deletion when a service has no associated compute nodes.

Comment 24 Artom Lifshitz 2021-08-09 20:06:26 UTC

Fix has been merged in our OSP 16.1 branch, and can be obtained from https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1688747 for hotfix delivery.

Comment 48 Artom Lifshitz 2021-10-12 13:43:27 UTC

@DFG:Upgrades - just looking for an answer to the question in comment #47, nothing else.

Comment 65 errata-xmlrpc 2022-03-24 11:04:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenStack Platform 16.1 (openstack-nova) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0983

Comment 66 Artom Lifshitz 2022-06-06 16:42:57 UTC

*** Bug 2092802 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.