Description of problem: After upgrade from osp13 to osp16.1 nova-compute and nova-consoleauth services are in DOWN state on controllers. Looks like stale entries and deleting the services results into the error: +++ Failed to delete compute service with ID 'dfde0572-447e-4efe-8200-daf76f39f098': Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'nova.exception.ComputeHostNotFound'> (HTTP 500) (Request-ID: req-a4313d40-a9e3-47ca-99c8-e11183235092) +++ This is impacting migration of VMs. +++ Instance has an associated NUMA topology, cell contains compute nodes older than train, and the enable_numa_live_migration workaround is disabled. Refusing to perform the live migration, as the instance NUMA topology, including related attributes such as CPU pinning, huge page and emulator thread pinning information, cannot be recalculated. See bug #1289064 for more information. +++ Version-Release number of selected component (if applicable): RHOSP16.1 Actual results: nova-consoleauth & nova-compute service in DOWN state after upgrade. Expected results: nova-consoleauth & nova-compute should have been cleaned up during upgrade. Additional info:
(In reply to Rohini Diwakar from comment #0) > Description of problem: > After upgrade from osp13 to osp16.1 nova-compute and nova-consoleauth > services are in DOWN state on controllers. Looks like stale entries and > deleting the services results into the error: > +++ > Failed to delete compute service with ID > 'dfde0572-447e-4efe-8200-daf76f39f098': Unexpected API Error. Please report > this at http://bugs.launchpad.net/nova/ and attach the Nova API log if > possible. > <class 'nova.exception.ComputeHostNotFound'> (HTTP 500) (Request-ID: > req-a4313d40-a9e3-47ca-99c8-e11183235092) I wasn't able to find this request ID in any of the logs in the sosreports on supportshell: [alifshit@supportshell 02975383]$ grep -l req-a4313d40-a9e3-47ca-99c8-e11183235092 -R `!!` grep -l req-a4313d40-a9e3-47ca-99c8-e11183235092 -R `find . -name '*nova*log*'` [supportshell.prod.useraccess-us-west-2.redhat.com] [13:50:51+0000] [alifshit@supportshell 02975383]$ [alifshit@supportshell 02975383]$ zgrep -l req-a4313d40-a9e3-47ca-99c8-e11183235092 -R `find . -name '*nova*log*'` gzip: -R.gz: No such file or directory [supportshell.prod.useraccess-us-west-2.redhat.com] [13:52:14+0000] [alifshit@supportshell 02975383]$ I think that should be the place where we start looking - understanding the full error for why Nova wasn't able to delete the service. Any chance we could get at least the full stacktrace, or another set of sosreports uploaded that include that request ID? Thanks! > +++ > > This is impacting migration of VMs. > +++ > Instance has an associated NUMA topology, cell contains compute nodes older > than train, and the enable_numa_live_migration workaround is disabled. > Refusing to perform the live migration, as the instance NUMA topology, > including related attributes such as CPU pinning, huge page and emulator > thread pinning information, cannot be recalculated. See bug #1289064 for > more information. > +++ > > Version-Release number of selected component (if applicable): > RHOSP16.1 > > > Actual results: > nova-consoleauth & nova-compute service in DOWN state after upgrade. > > Expected results: > nova-consoleauth & nova-compute should have been cleaned up during upgrade. > > Additional info:
Hi again - another question. Manually deleting nova-compute services is not part of the 13 to 16.1 FFU process. What initiated the delete request? The upgrade tooling, or a human being, attempting to clean up left over services? Thanks!
@Luigi: Your thing isn't actually related to FFU, and I suspect can be reproduced with just a controller replacement. I've started a WIP patch [1] to fix it. @Rohini: The stack trace in your case is exactly the same as Luigi's, but in your case there's no mention of a controller replacement. As far as I can tell, a simple FFU should not cause the behaviour that you're seeing. Are you sure there isn't a controller replacement, or a similar operation, going on here? [1] https://review.opendev.org/c/openstack/nova/+/801285
as far as we can tell there seam too be a gap in the FFU proceedure for ironic. nova is working correctly in prevent numa live migration since the ironic compute services have not been upgraded so that is not a bug. in ironic deployement the ironic virt driver runs in compute services deployed on the contolers. this seams to not have been taken into account so the hard_prov dfg and upgrades dfg need to come to an agreement on how to adress this. the current workflow for upgrades is that all compute services must be fully upgrade or in the hybrid state before any migration may happen. as such from a compute dfg perspective we would expect all 3 contoler to be fully upgraded including the ironc compute service instnaces prior to putting the other comptue services into the hybrid state(new containers old rhel) and then and only then shoudl any migration happen as compute hosts are leapp upgraded form rhel 7 to rhel 8. moving this over to the hard_prov dfg to triage and determin what workflow is best in this case.
Because this is getting a bit confusing, here's a summary that I'll reproduce in the other affected BZs. BZ 1986406: Documentation for controller replacement BZ 1990034: Documentation for FFU with Ironic as the virt driver BZ 1977667 (this BZ): Nova fix to allow service deletion when a service has no associated compute nodes.
Fix has been merged in our OSP 16.1 branch, and can be obtained from https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1688747 for hotfix delivery.
@DFG:Upgrades - just looking for an answer to the question in comment #47, nothing else.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenStack Platform 16.1 (openstack-nova) security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0983
*** Bug 2092802 has been marked as a duplicate of this bug. ***