Description of problem: After updating 16.2.5, customer is facing the same issue in different sites/platforms: live-migrations are failing The problem seems related to the Scheduler, not able to find an available host (also in an almost empty platform) Version-Release number of selected component (if applicable): RHOSP 16.2.5 How reproducible: Trying to live-migrate a vm Actual results: Filter AvailabilityZoneFilter returned 0 hosts Expected results: Live migration to succeed Additional info: (more info/data will follow in comments) Customer observed that: platforms updated from 16.2.4 --> 16.2.5 works fine Platforms updated from a minor prior to 16.2.4 are hitting the issue 16.2.3 --> 16.2.5 failing 16.2.1 --> 16.2.5 failing
i dont think that engineering should provide a detailed RCA in this case. live migrating via horizon should not be able to break nova. form artom's initial triage fo this issue it appear that the AZ were created after the instance they looked at was created. Nova has code to prevent you from changing the AZ fo a host that has an instance on it. that code was in place from osp 15 so it should not be possible for them to move any host that had an instance form the nova az to Prod_AZ1 reviewing the change that introduced the instance check https://github.com/openstack/nova/commit/8e19ef4173906da0b7c761da4de0728a2fd71e24 it missed one edgecase when invoked in add_host_to_aggregate check_no_instances_in_az was not passed https://github.com/openstack/nova/blob/master/nova/compute/api.py#L6648-L6649 self.is_safe_to_update_az(context, aggregate.metadata, hosts=[host_name], aggregate=aggregate) it defaults to false https://github.com/openstack/nova/blob/master/nova/compute/api.py#L6549-L6553 def is_safe_to_update_az(self, context, metadata, aggregate, hosts=None, action_name=AGGREGATE_ACTION_ADD, check_no_instances_in_az=False): so currently its technically possible but not supported to add a host to an az provide it not currently a part of an existing az and bypass the instance check. it has never been supported to move a host with an instance between az but because check_no_instances_in_az=true is not passed there can happen via the API. my guess is that is what the customer did they moved the hosts into the az with an instance on them even though that is not a supported operation but they were not blocked from doing that due to the bug. Os the root cause would be user error that was not blocked by the API due to a missing check to reject the invalid request resulting in a host with instance being moved to a different az. as a result any instance that requested the nova AZ specifically can no longer be moved.
If this is using the 'Update host aggregate' API in Nova, which looks like it does, then this is specifically going something that Nova strongly discourages, as you can see in the red warning box in the api-ref link at [1]. Let me add DFG:UI to this BZ for the Horizon angle. @DFG:UI - tl;dr Customer is doing Admin tab --> Compute --> Host Aggregates --> AZ1 --> Edit Host Aggregates --> Selected 4 nodes into AZ1 --> Save. It looks to be like this would hit Nova's 'Update host aggregate' API, and we specifically have a note to NOT [1] do that if there are instances on the affected hosts. Is my understanding correct, and can something be done in Horizon to at least show the same warning as Nova's api-ref? [1] https://docs.openstack.org/api-ref/compute/?expanded=update-aggregate-detail#update-aggregate
I also would like to know if the them doing Admin tab --> Compute --> Host Aggregates --> AZ1 --> Edit Host Aggregates --> Selected 4 nodes into AZ1 --> Save is the same as them running the command "openstack server unshelve <$UUID> --availability-zone PROD_AZ1' do the same or are the same?
Those are completely different things. What the customer did in Horizon updates the aggregate metadata and effectively moves hosts to a different AZ. The openstack unshelve command operates on _instances_ (not on _hosts_). You can read more about shelving here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/creating_and_managing_instances/assembly_managing-an-instance_instances#proc_shelving-an-instance_instances