Description of problem: When an affinity group is created and VMs are associated with this affinity group, they fail to evacuate during nova-evacuate. The ServerGroupAffinityFilter fails to return the available hosts. Version-Release number of selected component (if applicable): RHOSP-10 How reproducible: Always Steps to Reproduce: 1. 3 compute nodes, each with 2 VM instance pairs (each pair in it's own Affinity) group 2. Scheduler has placed a VM pair on each of the 3 computes. 3. Stop nova-compute on one compute 4. Initiate nova-evacuate for one the compute that nova-compute is stopped. Actual results: The scheduler fails to place the VM pair on any host, failing for ServerGroupAffinityFilter Expected results: The instance pair should be migrated to one of the other available computes, with both instances on the same compute. Additional info: Log entry showing ServerGroupAffinityFilter not passing any hosts: 2018-04-26 13:22:43.037 601373 DEBUG nova.filters [req-cd3303ff-7b1a-4446-8ee4-2b91ae0e90a2 f45ce2faaa024212a13b84b1ecf088a4 5070892a5ef644d99e015e3c626a0084 - - -] Filtering removed all hosts for the request with instance ID 'e872b26c-5702-4dd5-9b24-da5be5835ce2'. Filter results: [('RetryFilter', [(u'overcloud-controller-0.localdomain', u'overcloud-controller-0.localdomain'), (u'overcloud-compute-1.localdomain', u'overcloud-compute-1.localdomain'), (u'overcloud-compute-2.localdomain', u'overcloud-compute-2.localdomain')]), ('AvailabilityZoneFilter', [(u'overcloud-controller-0.localdomain', u'overcloud-controller-0.localdomain'), (u'overcloud-compute-1.localdomain', u'overcloud-compute-1.localdomain'), (u'overcloud-compute-2.localdomain', u'overcloud-compute-2.localdomain')]), ('RamFilter', [(u'overcloud-controller-0.localdomain', u'overcloud-controller-0.localdomain'), (u'overcloud-compute-1.localdomain', u'overcloud-compute-1.localdomain'), (u'overcloud-compute-2.localdomain', u'overcloud-compute-2.localdomain')]), ('DiskFilter', [(u'overcloud-controller-0.localdomain', u'overcloud-controller-0.localdomain'), (u'overcloud-compute-1.localdomain', u'overcloud-compute-1.localdomain'), (u'overcloud-compute-2.localdomain', u'overcloud-compute-2.localdomain')]), ('ComputeFilter', [(u'overcloud-compute-1.localdomain', u'overcloud-compute-1.localdomain'), (u'overcloud-compute-2.localdomain', u'overcloud-compute-2.localdomain')]), ('ComputeCapabilitiesFilter', [(u'overcloud-compute-1.localdomain', u'overcloud-compute-1.localdomain'), (u'overcloud-compute-2.localdomain', u'overcloud-compute-2.localdomain')]), ('ImagePropertiesFilter', [(u'overcloud-compute-1.localdomain', u'overcloud-compute-1.localdomain'), (u'overcloud-compute-2.localdomain', u'overcloud-compute-2.localdomain')]), ('ServerGroupAntiAffinityFilter', [(u'overcloud-compute-1.localdomain', u'overcloud-compute-1.localdomain'), (u'overcloud-compute-2.localdomain', u'overcloud-compute-2.localdomain')]), ('ServerGroupAffinityFilter', None)] get_filtered_objects /usr/lib/python2.7/site-packages/nova/filters.py:129 2018-04-26 13:22:43.037 601373 INFO nova.filters [req-cd3303ff-7b1a-4446-8ee4-2b91ae0e90a2 f45ce2faaa024212a13b84b1ecf088a4 5070892a5ef644d99e015e3c626a0084 - - -] Filtering removed all hosts for the request with instance ID 'e872b26c-5702-4dd5-9b24-da5be5835ce2'. Filter results: ['RetryFilter: (start: 3, end: 3)', 'AvailabilityZoneFilter: (start: 3, end: 3)', 'RamFilter: (start: 3, end: 3)', 'DiskFilter: (start: 3, end: 3)', 'ComputeFilter: (start: 3, end: 2)', 'ComputeCapabilitiesFilter: (start: 2, end: 2)', 'ImagePropertiesFilter: (start: 2, end: 2)', 'ServerGroupAntiAffinityFilter: (start: 2, end: 2)', 'ServerGroupAffinityFilter: (start: 2, end: 0)']
Nova does not (and can not) move two instances at the same time. It also will not move an instance in response to some action other than an instance move request (i.e. a change in the server group). Thus, there's really no way for nova today to do the thing you expect (moving both instances) today, and it is unlikely that this will change. The only realistic potential for improvement here is to allow adding and removing instances from server groups which would let you break the affinity bond and move one of the instances alone. However, you'd have to have some way to move both of those to the same host before you could put them back in an affinity group. CRUD operations on server groups has been a point of contention upstream in the past and is also not likely to be implemented soon. So, I'm going to close this as WONTFIX for the above reasons. If you want to actually request implementation of CRUD operations as a workaround, that should be a new RFE bug.
As a workaround, as of microversion 2.29 (included in Newton/OSP10), it is possible for an admin to entirely avoid the scheduler during an evacuate operation by passing a host and the 'force' parameter [1]. Using this, an admin can manually chose a new host and evacuate all instances in the same affinity group to that host. This allows the admin to temporarily "break" affinity to evacuate instances. [1] https://developer.openstack.org/api-ref/compute/#evacuate-server-evacuate-action