Created attachment 1887056 [details] nova-compute log Description of problem: There is a bz #2077964 about failing live-migration on OSP17 which seems to be solving problem with rollback of failed live-migration with following patch: https://review.opendev.org/c/openstack/nova/+/839227/ But It seems the original problem why the migration fails was not looked at. I applied the job and the tempest live migration tests still fail on following error: ERROR nova.virt.libvirt.driver [-] [instance: c3dd26db-d700-469a-8640-4aefac17883f] Live Migration failure: internal error: unable to execute QEMU command 'migrate-set-capabilities': Po stcopy is not supported: libvirt.libvirtError: internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported But the rollback is successful and there is no NotImplemented error thrown anymore. The following tempest tests are failing: tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test" name="test_live_block_migration tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test" name="test_live_migration_with_trunk tempest.api.compute.admin.test_live_migration.LiveMigrationTest" name="test_live_block_migration tempest.api.compute.admin.test_live_migration.LiveMigrationTest" name="test_live_migration_with_trunk It may not be problem of nova itself but failing here for triage and It may be already debugged in that case please close this as a duplicate or attach upstream patch/bz. Version-Release number of selected component (if applicable): openstack-nova-common-23.2.1-0.20220508110405.0190d58.el9ost.noarch openstack-nova-compute-23.2.1-0.20220508110405.0190d58.el9ost.noarch openstack-nova-migration-23.2.1-0.20220508110405.0190d58.el9ost.noarch How reproducible: Always Steps to Reproduce: 1 Live-migrate an instance (probably block live-migration without shared storage) Actual results: Migration fails and rollback is triggered. Expected results: Successfully migrated instance Additional info:
I initially though we might have regressed on https://bugzilla.redhat.com/show_bug.cgi?id=1986567, but the fix for that is in RHOS-17.0-RHEL-8-20220502.n.2, which AFAICT is earlier than the affected versions in the description. And in any case, that BZ was for vhost-user network interfaces, and based on the XML for instance c3dd26db-d700-469a-8640-4aefac17883f that I see in the attached nova-compute log, it's using kernel vhost.
can you provide the nova.conf and or the full sos reports for the failing node. without that we likely will have to close this as not enough info. there are a number config options that can break postcopy if enable i belive. otherwise this might infact be a regression in rhel/qemu ill take a look at the attached compute log shortly but skiming it it does not have the config options form agent start up present so i likely can deduce the failure form that log alone.
Tech call notes: this BZ will track reverting postcopy by default from OSP 17.0 (we've never done a release with postcopy turned on by default), will clone this to 17.1 to do the actual root cause analysis within qemu/libvirt/gremlins?
*** This bug has been marked as a duplicate of bug 2110556 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days