Description of problem: Live-migration of DPDK VMs from compute node 13 to 16.1 fails. Cold migration works fine. (overcloud) [stack@undercloud ~]$ openstack server list --long | grep dpdk-inst2 | 1102df92-4228-4ebe-855a-02fe4b0fec96 | dpdk-inst2 | ACTIVE | None | Running | dpdk=192.168.24.250 | rhel7 | ca78bb6a-0630-42ac-9b7c-e52f5d4e81a5 | | | nova | overcloud-computeovsdpdk-0.localdomain | The instance does end up on the destination compute node(16.1) but it seems to have failed due to unsupported features. ~~~ [root@overcloud-computeovsdpdk-1 ~]# tail -n8 /var/log/libvirt/qemu/instance-0000000e.log 2020-11-27T08:11:40.854290Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu52667eae-a7,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu52667eae-a7,server char device redirected to /dev/pts/2 (label charserial0) 2020-11-27T08:11:42.158333Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead 2020-11-27T08:11:49.771254Z qemu-kvm: Features 0x130afe7a2 unsupported. Allowed features: 0x178bfa7e6 2020-11-27T08:11:49.771301Z qemu-kvm: Failed to load virtio-net:virtio 2020-11-27T08:11:49.771312Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net' 2020-11-27T08:11:49.771683Z qemu-kvm: load of migration failed: Operation not permitted 2020-11-27 08:11:49.987+0000: shutting down, reason=failed ~~~ nova-compute debug logs on the source node ~~~ 2020-11-30 06:31:40.245 7 DEBUG nova.virt.libvirt.migration [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Current 50 elapsed 9 steps [(0, 50), (300, 95), (600, 140), (900, 185), (1200, 230), (1500, 275), (1800, 320), (2100, 365), (2400, 410), (270 0, 455), (3000, 500)] update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:501 2020-11-30 06:31:40.246 7 DEBUG nova.virt.libvirt.migration [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:513 2020-11-30 06:31:40.256 7 ERROR nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2020-11-30T06:31:31.524866Z qemu-kvm: -chardev socket,id=charnet0, path=/var/lib/vhost_sockets/vhu52667eae-a7,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu52667eae-a7,server 2020-11-30T06:31:32.395046Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead 2020-11-30T06:31:39.931517Z qemu-kvm: Features 0x130afe7a2 unsupported. Allowed features: 0x178bfa7e6 2020-11-30T06:31:39.931562Z qemu-kvm: Failed to load virtio-net:virtio 2020-11-30T06:31:39.931573Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net' 2020-11-30T06:31:39.931948Z qemu-kvm: load of migration failed: Operation not permitted: libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2020-11-30T06:31:31.524866Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_soc kets/vhu52667eae-a7,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu52667eae-a7,server 2020-11-30 06:31:40.257 7 DEBUG nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Migration operation thread notification thread_finished /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9144 2020-11-30 06:31:40.281 7 INFO nova.compute.manager [req-afa6db20-2e13-4fcf-bed6-606de5f0fc36 - - - - -] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] During sync_power_state the instance has a pending task (migrating). Skip. 2020-11-30 06:31:40.749 7 DEBUG nova.virt.libvirt.migration [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] VM running on src, migration failed _log /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:419 2020-11-30 06:31:40.750 7 DEBUG nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Fixed incorrect job type to be 4 _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8958 2020-11-30 06:31:40.750 7 ERROR nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Migration operation has aborted ~~~ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: While upgrading osp13 compute node, migrate the workload to osp16.1 compute node Actual results: Workload migration fails, which prevents upgrading of the compute node to 16.1 Expected results: Workload should migrate to osp16.1 compute node to perform upgrade Additional info:
Hello, I've encountered the similar problem during an attempt to live migrate a dpdk instance from an older compute node (on rhosp13) to an upgraded compute node (rhosp16.1). Here is the error snippet: ~~~ 2020-12-04 10:00:21.134 8 ERROR nova.virt.libvirt.driver [-] [instance: b129442b-e162-490f-b30c-7e1d99dce35b] Live Migration failure: internal error: qemu une xpectedly closed the monitor: 2020-12-04T10:00:09.889529Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu25fdafef-e9,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu25fdafef-e9,server 2020-12-04T10:00:12.080859Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card in stead 2020-12-04T10:00:20.705320Z qemu-kvm: Features 0x130afe7a2 unsupported. Allowed features: 0x178bfa7e6 2020-12-04T10:00:20.705355Z qemu-kvm: Failed to load virtio-net:virtio 2020-12-04T10:00:20.705364Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net' 2020-12-04T10:00:20.705722Z qemu-kvm: load of migration failed: Operation not permitted: libvirt.libvirtError: internal error: qemu unexpectedly closed the mo nitor: 2020-12-04T10:00:09.889529Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu25fdafef-e9,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu25fdafef-e9,server ~~~ And the instance was rolled back to the source compute node, I've halted the upgrade process for now to complete the migration process first. So, if there are any specific logs needed I can help with that information. Qemu-kvm version: On destination node: (RHEL 8.2, within the nova_libvirt container) qemu-kvm-common-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64 qemu-kvm-block-curl-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64 qemu-kvm-core-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64 On source node: qemu-kvm-common-rhev-2.12.0-48.el7_9.1.x86_64 qemu-kvm-rhev-2.12.0-48.el7_9.1.x86_64
(In reply to Ketan Mehta from comment #6) > Hello, > > I've encountered the similar problem during an attempt to live migrate a > dpdk instance from an older compute node (on rhosp13) to an upgraded compute > node (rhosp16.1). > > Here is the error snippet: > > ~~~ > 2020-12-04 10:00:21.134 8 ERROR nova.virt.libvirt.driver [-] [instance: > b129442b-e162-490f-b30c-7e1d99dce35b] Live Migration failure: internal > error: qemu une > xpectedly closed the monitor: 2020-12-04T10:00:09.889529Z qemu-kvm: -chardev > socket,id=charnet0,path=/var/lib/vhost_sockets/vhu25fdafef-e9,server: info: > QEMU > waiting for connection on: > disconnected:unix:/var/lib/vhost_sockets/vhu25fdafef-e9,server > > 2020-12-04T10:00:12.080859Z qemu-kvm: -device > cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is > deprecated, please use a different VGA card in > stead > 2020-12-04T10:00:20.705320Z qemu-kvm: Features 0x130afe7a2 unsupported. > Allowed features: 0x178bfa7e6 > > 2020-12-04T10:00:20.705355Z qemu-kvm: Failed to load virtio-net:virtio > 2020-12-04T10:00:20.705364Z qemu-kvm: error while loading state for instance > 0x0 of device '0000:00:03.0/virtio-net' > > 2020-12-04T10:00:20.705722Z qemu-kvm: load of migration failed: Operation > not permitted: libvirt.libvirtError: internal error: qemu unexpectedly > closed the mo > nitor: 2020-12-04T10:00:09.889529Z qemu-kvm: -chardev > socket,id=charnet0,path=/var/lib/vhost_sockets/vhu25fdafef-e9,server: info: > QEMU waiting for connection > on: disconnected:unix:/var/lib/vhost_sockets/vhu25fdafef-e9,server > ~~~ > > And the instance was rolled back to the source compute node, I've halted the > upgrade process for now to complete the migration process first. > > So, if there are any specific logs needed I can help with that information. > > Qemu-kvm version: > > On destination node: (RHEL 8.2, within the nova_libvirt container) > > qemu-kvm-common-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64 > qemu-kvm-block-curl-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64 > qemu-kvm-core-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64 > > On source node: > > qemu-kvm-common-rhev-2.12.0-48.el7_9.1.x86_64 > qemu-kvm-rhev-2.12.0-48.el7_9.1.x86_64 Please refer this note: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/director_installation_and_usage/index#overcloud-storage You have two options, use ceph/ceph/external for LVM, use case upgrade will pass without workloads only, it is tested but not supported used case. For ovs+dpdk we have BZ with w/a https://bugzilla.redhat.com/show_bug.cgi?id=1895887
sorry ment to update this on friday. to avoid the proliferation of bug im going to close this as a duplicate of the existing DDF bug https://bugzilla.redhat.com/show_bug.cgi?id=1916869 in our internal call we confirmed the assertion that the work required to make this work is excessinve and would be an unresonable amount of technical debt to maintainer given the vm would evenutally have to be hard rebooted anyway. for that reason this will be address by a documentation update to not that livemigration with ovs-dpdk is not suport during FFU due to cahnges required to ensure vms only negotiate valid offloads when using ovs-dpdk. all aprochs we explored either require a vm reboot or a regression of the offload negociation fix followed by an eventual vm reboot so it is our view that it is better to take thet reboot upfront in the form of a cold migration. *** This bug has been marked as a duplicate of bug 1916869 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days