Bug 2182132

Summary: [OSP 16.1] Live migration only succeeds if destination hypervisor is specified
Product: Red Hat OpenStack Reporter: Matsvei Hauryliuk <mhauryli>
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED NOTABUG QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: low Docs Contact:
Priority: unspecified    
Version: 16.1 (Train)CC: alifshit, astupnik, dasmith, dhill, eglynn, enothen, jhakimra, jveiraca, kchamart, sbauza, sgordon, vromanso
Target Milestone: ---Keywords: Reopened
Target Release: ---Flags: mhauryli: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-31 08:39:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matsvei Hauryliuk 2023-03-27 16:08:55 UTC
Description of problem:
The client has a custom non-director deployment with hundreds of compute nodes separated between different host aggregates. When trying live migration of an instance without specifying destination host:

# openstack server migrate <instance01> --live-migration

an error message is triggered. However when specifying the destination hypervisor explicitly:

# openstack server migrate <instance01> --live-migration --host <host01> --os-compute-api-version 2.30

the operation is successful.

Inside the nova-conductor.log:
2023-02-20 14:36:01.585 32 WARNING nova.scheduler.utils |..| Setting instance to ACTIVE state.: nova.exception_Remote.InvalidSharedStorage_Remote: <host01> is not on shared storage: Shared storage live-migration requires either shared storage or boot-from-volume with no local disks.

Version-Release number of selected component (if applicable):
RHOSP 16.1.6

How reproducible:
Live-migrate an instance.

Steps to Reproduce:
1. # openstack server migrate <instance01> --live-migration

Actual results:
Migration failed.

Expected results:
Successful migration to an available host per actual workload on respective hypervisors.

Additional info:
We've collected the outputs from:
# openstack server show <instance01>
# openstack compute service list

As well as Sosreports from controller nodes and computes.

Comment 2 Artom Lifshitz 2023-03-27 18:05:45 UTC
I suspect this is a side effect of the use of the new microversion rather than the specifying of the destination. In microversions before 2.24, the block_migration parameter was a boolean that was either True or False. If set to False but with the instance not on shared storage, live migration fails as block migration is necessary to migrate the instance's disk. Starting with 2.25, block_migration is a string that can accept the 'auto' value, which makes Nova auto-detect the storage situation. See [1] for full details.

If you retry the migration as follows:

# openstack server migrate <instance01> --live-migration --os-compute-api-version 2.30

I expect it to work.

[1] https://docs.openstack.org/api-ref/compute/?expanded=live-migrate-server-os-migratelive-action-detail#live-migrate-server-os-migratelive-action

Comment 17 Artom Lifshitz 2023-06-12 22:35:53 UTC
I have closed this bug as it has been waiting for more info for at least 4 weeks (see my comment #14). We only do this to ensure that we don't accumulate stale bugs which can't be addressed. If you are able to provide the requested information, please feel free to re-open this bug.

Comment 30 Sylvain Bauza 2023-07-31 08:39:42 UTC
So, I created a functional test for trying to see whether force_hosts/nodes and the scheduler hint '_nova_check_type' were persisted after rebuilding an instance.

As you can see in https://review.opendev.org/c/openstack/nova/+/889843, no, we *don't persist* these RequestSpec fields' values so any other move operation after rebuild shouldn't have any problem.
Here, then, again, I can't say anything but the fact that it's not the root cause of this BZ.

Given that we're not able to reproduce the problem, that we also don't really be able to know the root cause of the live migration issue, and given it looks the customer is now happy, I'm preferring to close this bug report as NOTABUG.