Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2244628

Summary: [RHOSP 13 to 16.2 Upgrades][OvS-DPDK] DPDK vms fail to live-migrate between 13->16.2 upgrade
Product: Red Hat OpenStack Reporter: Vadim Khitrin <vkhitrin>
Component: documentationAssignee: RHOS Documentation Team <rhos-docs>
Status: CLOSED NOTABUG QA Contact: RHOS Documentation Team <rhos-docs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 16.2 (Train)CC: alifshit, cfields, dgilbert, dvd, eshulman, fbaudin, fhallal, fleitner, gregraka, hakhande, i.maximets, jamsmith, kchamart, kmehta, kthakre, maxime.coquelin, mburns, morazi, msufiyan, nlevinki, smooney, ykulkarn, yrachman
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1902631 Environment:
Last Closed: 2023-11-14 13:54:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1902631, 1916832, 1917817    
Bug Blocks:    

Description Vadim Khitrin 2023-10-17 12:26:33 UTC
+++ This bug was initially created as a clone of Bug #1902631 +++

Description of problem:

Live-migration of DPDK VMs from compute node 13 to 16. fails. Cold migration works fine.

(overcloud) [stack@undercloud ~]$ openstack server list --long | grep dpdk-inst2
| 1102df92-4228-4ebe-855a-02fe4b0fec96 | dpdk-inst2 | ACTIVE | None       | Running     | dpdk=192.168.24.250 | rhel7      | ca78bb6a-0630-42ac-9b7c-e52f5d4e81a5 |             |           | nova              | overcloud-computeovsdpdk-0.localdomain | 


The instance does end up on the destination compute node(16.2) but it seems to have failed due to unsupported features. 
~~~
[root@overcloud-computeovsdpdk-1 ~]# tail -n8  /var/log/libvirt/qemu/instance-0000000e.log
2020-11-27T08:11:40.854290Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu52667eae-a7,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu52667eae-a7,server
char device redirected to /dev/pts/2 (label charserial0)
2020-11-27T08:11:42.158333Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead
2020-11-27T08:11:49.771254Z qemu-kvm: Features 0x130afe7a2 unsupported. Allowed features: 0x178bfa7e6
2020-11-27T08:11:49.771301Z qemu-kvm: Failed to load virtio-net:virtio
2020-11-27T08:11:49.771312Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net'
2020-11-27T08:11:49.771683Z qemu-kvm: load of migration failed: Operation not permitted
2020-11-27 08:11:49.987+0000: shutting down, reason=failed
~~~


nova-compute debug logs on the source node
~~~
2020-11-30 06:31:40.245 7 DEBUG nova.virt.libvirt.migration [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Current 50 elapsed 9 steps [(0, 50), (300, 95), (600, 140), (900, 185), (1200, 230), (1500, 275), (1800, 320), (2100, 365), (2400, 410), (270
0, 455), (3000, 500)] update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:501
2020-11-30 06:31:40.246 7 DEBUG nova.virt.libvirt.migration [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:513
2020-11-30 06:31:40.256 7 ERROR nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2020-11-30T06:31:31.524866Z qemu-kvm: -chardev socket,id=charnet0,
path=/var/lib/vhost_sockets/vhu52667eae-a7,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu52667eae-a7,server
2020-11-30T06:31:32.395046Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead
2020-11-30T06:31:39.931517Z qemu-kvm: Features 0x130afe7a2 unsupported. Allowed features: 0x178bfa7e6
2020-11-30T06:31:39.931562Z qemu-kvm: Failed to load virtio-net:virtio
2020-11-30T06:31:39.931573Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net'
2020-11-30T06:31:39.931948Z qemu-kvm: load of migration failed: Operation not permitted: libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2020-11-30T06:31:31.524866Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_soc
kets/vhu52667eae-a7,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu52667eae-a7,server
2020-11-30 06:31:40.257 7 DEBUG nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Migration operation thread notification thread_finished /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9144
2020-11-30 06:31:40.281 7 INFO nova.compute.manager [req-afa6db20-2e13-4fcf-bed6-606de5f0fc36 - - - - -] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] During sync_power_state the instance has a pending task (migrating). Skip.
2020-11-30 06:31:40.749 7 DEBUG nova.virt.libvirt.migration [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] VM running on src, migration failed _log /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:419
2020-11-30 06:31:40.750 7 DEBUG nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Fixed incorrect job type to be 4 _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8958
2020-11-30 06:31:40.750 7 ERROR nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Migration operation has aborted
~~~


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
While upgrading osp13 compute node, migrate the workload to osp16.2 compute node

Actual results:
Workload migration fails, which prevents upgrading of the compute node to 16.2

Expected results:
Workload should migrate to osp16.1 compute node to perform upgrade

Comment 1 Artom Lifshitz 2023-10-18 15:25:45 UTC
Thanks for the bug report, couple of questions:

> Live-migration of DPDK VMs from compute node 13 to 16. fails. Cold migration works fine.

Please tell me that you mean RHEL 7 to RHEL 8, and that all OSP containers have been upgraded to OSP 16. Running both OSP 13 and OSP 16 compute containers at the same time in the same deployment has never been supported.

> Expected results:
> Workload should migrate to osp16.1 compute node to perform upgrade

You filed this bug under documentation. As part of [1], of which this BZ is a double-duplicate of (this BZ is a duplicate of 1902631, which is itself a duplicate of 1916869), we already documented that live migration of OVS-DPDK instances during FFU is not supported [2]. What is your expectation in filing this BZ? We're not going to do any more code fixes for 16.1 or 16.2...

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1916869
[2] https://gitlab.cee.redhat.com/rhci-documentation/docs-Red_Hat_Enterprise_Linux_OpenStack_Platform/-/merge_requests/6001/diffs

Comment 2 Vadim Khitrin 2023-10-25 11:02:46 UTC
(In reply to Artom Lifshitz from comment #1)
> Thanks for the bug report, couple of questions:
> 
> > Live-migration of DPDK VMs from compute node 13 to 16. fails. Cold migration works fine.
> 
> Please tell me that you mean RHEL 7 to RHEL 8, and that all OSP containers
> have been upgraded to OSP 16. Running both OSP 13 and OSP 16 compute
> containers at the same time in the same deployment has never been supported.

Yes, this is the case, as you mentioned, I cloned this bug from an original bug.
After updating operating system to RHEL 8, we can not live migrate workload from RHEL 7 host.

> 
> > Expected results:
> > Workload should migrate to osp16.1 compute node to perform upgrade
> 
> You filed this bug under documentation. As part of [1], of which this BZ is
> a double-duplicate of (this BZ is a duplicate of 1902631, which is itself a
> duplicate of 1916869), we already documented that live migration of OVS-DPDK
> instances during FFU is not supported [2]. What is your expectation in
> filing this BZ? We're not going to do any more code fixes for 16.1 or 16.2...

We have cloned this bug to mention it in the 16.2.6 release notes if it is not present.
Our team asked the openvswitch team about this issue, and we were told they were not expecting this to work (we also have asked in the Nova channel, and we were told that the worst case scenario, cold migration, is still supported).

> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1916869
> [2]
> https://gitlab.cee.redhat.com/rhci-documentation/docs-
> Red_Hat_Enterprise_Linux_OpenStack_Platform/-/merge_requests/6001/diffs