Bug 2207968

Summary: [OSP 16.2][nova] live-migrations failing after updating to 16.2.5
Product: Red Hat OpenStack Reporter: Flavio Piccioni <fpiccion>
Component: openstack-novaAssignee: Artom Lifshitz <alifshit>
Status: ASSIGNED --- QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: abhijadh, alifshit, dasmith, eglynn, jhakimra, jhardee, kchamart, sbauza, sgordon, smooney, sukar, vromanso
Target Milestone: z2Keywords: Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Flavio Piccioni 2023-05-17 13:14:58 UTC
Description of problem:
After updating 16.2.5, customer is facing the same issue in different sites/platforms: live-migrations are failing

The problem seems related to the Scheduler, not able to find an available host (also in an almost empty platform)

Version-Release number of selected component (if applicable):
RHOSP 16.2.5

How reproducible:
Trying to live-migrate a vm


Actual results:
Filter AvailabilityZoneFilter returned 0 hosts


Expected results:
Live migration to succeed


Additional info:
(more info/data will follow in comments)

Customer observed that: 
platforms updated from 16.2.4  --> 16.2.5 works fine

Platforms updated from a minor prior to 16.2.4 are hitting the issue
16.2.3  --> 16.2.5 failing
16.2.1  --> 16.2.5 failing

Comment 17 smooney 2023-05-23 12:24:37 UTC
i dont think that engineering should provide a detailed RCA in this case.

live migrating via horizon should not be able to break nova.
form artom's initial triage fo this issue it appear that the AZ were created after the instance they looked at was created.
Nova has code to prevent you from changing the AZ fo a host that has an instance on it.
that code was in place from osp 15 so it should not be possible for them to move any host that had an instance form the nova az to Prod_AZ1

reviewing the change that introduced the instance check 

https://github.com/openstack/nova/commit/8e19ef4173906da0b7c761da4de0728a2fd71e24


it missed one edgecase 
when invoked in add_host_to_aggregate  check_no_instances_in_az was not passed


https://github.com/openstack/nova/blob/master/nova/compute/api.py#L6648-L6649
self.is_safe_to_update_az(context, aggregate.metadata,
                                  hosts=[host_name], aggregate=aggregate)

it defaults to false https://github.com/openstack/nova/blob/master/nova/compute/api.py#L6549-L6553
    def is_safe_to_update_az(self, context, metadata, aggregate,
                             hosts=None,
                             action_name=AGGREGATE_ACTION_ADD,
                             check_no_instances_in_az=False):


so currently its technically possible but not supported to add a host to an az provide it not currently a part of an existing az
and bypass the instance check. it has never been supported to move a host with an instance between az but because check_no_instances_in_az=true is
not passed there can happen via the API.

my guess is that is what the customer did
they moved the hosts into the az with an instance on them even though that is not a supported operation but they were not blocked from doing that due
to the bug.

Os the root cause would be user error that was not blocked by the API due to a missing check to reject the invalid request resulting in a host with instance
being moved to a different az. as a result any instance that requested the nova AZ specifically can no longer be moved.

Comment 28 Artom Lifshitz 2023-06-05 17:45:39 UTC
If this is using the 'Update host aggregate' API in Nova, which looks like it does, then this is specifically going something that Nova strongly discourages, as you can see in the red warning box in the api-ref link at [1].

Let me add DFG:UI to this BZ for the Horizon angle.

@DFG:UI - tl;dr Customer is doing Admin tab --> Compute --> Host Aggregates --> AZ1 --> Edit Host Aggregates --> Selected 4 nodes into AZ1 --> Save. It looks to be like this would hit Nova's 'Update host aggregate' API, and we specifically have a note to NOT [1] do that if there are instances on the affected hosts. Is my understanding correct, and can something be done in Horizon to at least show the same warning as Nova's api-ref?

[1] https://docs.openstack.org/api-ref/compute/?expanded=update-aggregate-detail#update-aggregate

Comment 29 jhardee 2023-06-05 18:38:29 UTC
I also would like to know if the them doing Admin tab --> Compute --> Host Aggregates --> AZ1 --> Edit Host Aggregates --> Selected 4 nodes into AZ1 --> Save is the same as them running the command "openstack server unshelve <$UUID> --availability-zone PROD_AZ1' do the same or are the same?

Comment 30 Artom Lifshitz 2023-06-05 18:45:53 UTC
Those are completely different things.

What the customer did in Horizon updates the aggregate metadata and effectively moves hosts to a different AZ.

The openstack unshelve command operates on _instances_ (not on _hosts_). You can read more about shelving here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/creating_and_managing_instances/assembly_managing-an-instance_instances#proc_shelving-an-instance_instances