Bug 2093905 - [OSP17][Live Migration] Unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported
Summary: [OSP17][Live Migration] Unable to execute QEMU command 'migrate-set-capabilit...
Keywords:
Status: CLOSED DUPLICATE of bug 2110556
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 17.0
Assignee: Bogdan Dobrelya
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-06 10:34 UTC by Marian Krcmarik
Modified: 2023-09-15 01:55 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-28 12:50:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
nova-compute log (1.70 MB, text/plain)
2022-06-06 10:34 UTC, Marian Krcmarik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 850902 0 None ABANDONED [Wallaby-only] Disallow postcopy for live migration 2024-05-22 11:08:38 UTC
Red Hat Issue Tracker OSP-15543 0 None None None 2022-06-06 10:38:45 UTC

Description Marian Krcmarik 2022-06-06 10:34:31 UTC
Created attachment 1887056 [details]
nova-compute log

Description of problem:
There is a bz #2077964 about failing live-migration on OSP17 which seems to be solving problem with rollback of failed live-migration with following patch:
https://review.opendev.org/c/openstack/nova/+/839227/
But It seems the original problem why the migration fails was not looked at. I applied the job and the tempest live migration tests still fail on following error:
ERROR nova.virt.libvirt.driver [-] [instance: c3dd26db-d700-469a-8640-4aefac17883f] Live Migration failure: internal error: unable to execute QEMU command 'migrate-set-capabilities': Po
stcopy is not supported: libvirt.libvirtError: internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported
But the rollback is successful and there is no NotImplemented error thrown anymore.

The following tempest tests are failing:
tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test" name="test_live_block_migration
tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test" name="test_live_migration_with_trunk
tempest.api.compute.admin.test_live_migration.LiveMigrationTest" name="test_live_block_migration
tempest.api.compute.admin.test_live_migration.LiveMigrationTest" name="test_live_migration_with_trunk

It may not be problem of nova itself but failing here for triage and It may be already debugged in that case please close this as a duplicate or attach upstream patch/bz.

Version-Release number of selected component (if applicable):
openstack-nova-common-23.2.1-0.20220508110405.0190d58.el9ost.noarch
openstack-nova-compute-23.2.1-0.20220508110405.0190d58.el9ost.noarch
openstack-nova-migration-23.2.1-0.20220508110405.0190d58.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1 Live-migrate an instance (probably block live-migration without shared storage)

Actual results:
Migration fails and rollback is triggered.

Expected results:
Successfully migrated instance

Additional info:

Comment 1 Artom Lifshitz 2022-06-06 17:59:58 UTC
I initially though we might have regressed on https://bugzilla.redhat.com/show_bug.cgi?id=1986567, but the fix for that is in RHOS-17.0-RHEL-8-20220502.n.2, which AFAICT is earlier than the affected versions in the description. And in any case, that BZ was for vhost-user network interfaces, and based on the XML for instance c3dd26db-d700-469a-8640-4aefac17883f that I see in the attached nova-compute log, it's using kernel vhost.

Comment 2 smooney 2022-06-15 10:16:14 UTC
can you provide the nova.conf and or the full sos reports for the failing node.

without that we likely will have to close this as not enough info.

there are a number config options that can break postcopy if enable i belive.
otherwise this might infact be a regression in rhel/qemu

ill take a look at the attached compute log shortly but skiming it it does not have the config options form agent start up present so i
likely can deduce the failure form that log alone.

Comment 6 Artom Lifshitz 2022-07-11 16:04:30 UTC
Tech call notes: this BZ will track reverting postcopy by default from OSP 17.0 (we've never done a release with postcopy turned on by default), will clone this to 17.1 to do the actual root cause analysis within qemu/libvirt/gremlins?

Comment 12 smooney 2022-07-28 12:50:17 UTC

*** This bug has been marked as a duplicate of bug 2110556 ***

Comment 14 Red Hat Bugzilla 2023-09-15 01:55:32 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.