Bug 2093905

Summary: [OSP17][Live Migration] Unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: openstack-tripleo-heat-templatesAssignee: Bogdan Dobrelya <bdobreli>
Status: CLOSED DUPLICATE QA Contact: Joe H. Rahme <jhakimra>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 17.0 (Wallaby)CC: alifshit, bdobreli, dasmith, eglynn, jhakimra, kchamart, mburns, oblaut, sbauza, sgordon, smooney, stchen, vromanso
Target Milestone: betaKeywords: AutomationBlocker, Regression, Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-28 12:50:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
nova-compute log none

Description Marian Krcmarik 2022-06-06 10:34:31 UTC
Created attachment 1887056 [details]
nova-compute log

Description of problem:
There is a bz #2077964 about failing live-migration on OSP17 which seems to be solving problem with rollback of failed live-migration with following patch:
https://review.opendev.org/c/openstack/nova/+/839227/
But It seems the original problem why the migration fails was not looked at. I applied the job and the tempest live migration tests still fail on following error:
ERROR nova.virt.libvirt.driver [-] [instance: c3dd26db-d700-469a-8640-4aefac17883f] Live Migration failure: internal error: unable to execute QEMU command 'migrate-set-capabilities': Po
stcopy is not supported: libvirt.libvirtError: internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported
But the rollback is successful and there is no NotImplemented error thrown anymore.

The following tempest tests are failing:
tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test" name="test_live_block_migration
tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test" name="test_live_migration_with_trunk
tempest.api.compute.admin.test_live_migration.LiveMigrationTest" name="test_live_block_migration
tempest.api.compute.admin.test_live_migration.LiveMigrationTest" name="test_live_migration_with_trunk

It may not be problem of nova itself but failing here for triage and It may be already debugged in that case please close this as a duplicate or attach upstream patch/bz.

Version-Release number of selected component (if applicable):
openstack-nova-common-23.2.1-0.20220508110405.0190d58.el9ost.noarch
openstack-nova-compute-23.2.1-0.20220508110405.0190d58.el9ost.noarch
openstack-nova-migration-23.2.1-0.20220508110405.0190d58.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1 Live-migrate an instance (probably block live-migration without shared storage)

Actual results:
Migration fails and rollback is triggered.

Expected results:
Successfully migrated instance

Additional info:

Comment 1 Artom Lifshitz 2022-06-06 17:59:58 UTC
I initially though we might have regressed on https://bugzilla.redhat.com/show_bug.cgi?id=1986567, but the fix for that is in RHOS-17.0-RHEL-8-20220502.n.2, which AFAICT is earlier than the affected versions in the description. And in any case, that BZ was for vhost-user network interfaces, and based on the XML for instance c3dd26db-d700-469a-8640-4aefac17883f that I see in the attached nova-compute log, it's using kernel vhost.

Comment 2 smooney 2022-06-15 10:16:14 UTC
can you provide the nova.conf and or the full sos reports for the failing node.

without that we likely will have to close this as not enough info.

there are a number config options that can break postcopy if enable i belive.
otherwise this might infact be a regression in rhel/qemu

ill take a look at the attached compute log shortly but skiming it it does not have the config options form agent start up present so i
likely can deduce the failure form that log alone.

Comment 6 Artom Lifshitz 2022-07-11 16:04:30 UTC
Tech call notes: this BZ will track reverting postcopy by default from OSP 17.0 (we've never done a release with postcopy turned on by default), will clone this to 17.1 to do the actual root cause analysis within qemu/libvirt/gremlins?

Comment 12 smooney 2022-07-28 12:50:17 UTC

*** This bug has been marked as a duplicate of bug 2110556 ***

Comment 14 Red Hat Bugzilla 2023-09-15 01:55:32 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days