1902631 – [RHOSP 13 to 16.1 Upgrades][OvS-DPDK] DPDK vms fail to live-migrate between 13->16.1 upgrade

Bug 1902631 - [RHOSP 13 to 16.1 Upgrades][OvS-DPDK] DPDK vms fail to live-migrate between 13->16.1 upgrade

Summary: [RHOSP 13 to 16.1 Upgrades][OvS-DPDK] DPDK vms fail to live-migrate between ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1916869
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	documentation
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Maxime Coquelin
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:	1916832 1917817
Blocks:	2244628
TreeView+	depends on / blocked

Reported:	2020-11-30 09:03 UTC by Yadnesh Kulkarni
Modified:	2023-10-17 12:28 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1917817 2244628 (view as bug list)
Environment:
Last Closed:	2021-01-27 13:51:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-29796	None	None	None	2023-10-17 12:28:02 UTC
Red Hat Knowledge Base (Solution)	5571081	None	None	None	2020-12-22 14:28:21 UTC
Red Hat Knowledge Base (Solution)	5675781	None	None	None	2020-12-31 13:22:13 UTC

Description Yadnesh Kulkarni 2020-11-30 09:03:51 UTC

Description of problem:

Live-migration of DPDK VMs from compute node 13 to 16.1 fails. Cold migration works fine.

(overcloud) [stack@undercloud ~]$ openstack server list --long | grep dpdk-inst2
| 1102df92-4228-4ebe-855a-02fe4b0fec96 | dpdk-inst2 | ACTIVE | None       | Running     | dpdk=192.168.24.250 | rhel7      | ca78bb6a-0630-42ac-9b7c-e52f5d4e81a5 |             |           | nova              | overcloud-computeovsdpdk-0.localdomain | 


The instance does end up on the destination compute node(16.1) but it seems to have failed due to unsupported features. 
~~~
[root@overcloud-computeovsdpdk-1 ~]# tail -n8  /var/log/libvirt/qemu/instance-0000000e.log
2020-11-27T08:11:40.854290Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu52667eae-a7,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu52667eae-a7,server
char device redirected to /dev/pts/2 (label charserial0)
2020-11-27T08:11:42.158333Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead
2020-11-27T08:11:49.771254Z qemu-kvm: Features 0x130afe7a2 unsupported. Allowed features: 0x178bfa7e6
2020-11-27T08:11:49.771301Z qemu-kvm: Failed to load virtio-net:virtio
2020-11-27T08:11:49.771312Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net'
2020-11-27T08:11:49.771683Z qemu-kvm: load of migration failed: Operation not permitted
2020-11-27 08:11:49.987+0000: shutting down, reason=failed
~~~


nova-compute debug logs on the source node
~~~
2020-11-30 06:31:40.245 7 DEBUG nova.virt.libvirt.migration [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Current 50 elapsed 9 steps [(0, 50), (300, 95), (600, 140), (900, 185), (1200, 230), (1500, 275), (1800, 320), (2100, 365), (2400, 410), (270
0, 455), (3000, 500)] update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:501
2020-11-30 06:31:40.246 7 DEBUG nova.virt.libvirt.migration [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:513
2020-11-30 06:31:40.256 7 ERROR nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2020-11-30T06:31:31.524866Z qemu-kvm: -chardev socket,id=charnet0,
path=/var/lib/vhost_sockets/vhu52667eae-a7,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu52667eae-a7,server
2020-11-30T06:31:32.395046Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead
2020-11-30T06:31:39.931517Z qemu-kvm: Features 0x130afe7a2 unsupported. Allowed features: 0x178bfa7e6
2020-11-30T06:31:39.931562Z qemu-kvm: Failed to load virtio-net:virtio
2020-11-30T06:31:39.931573Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net'
2020-11-30T06:31:39.931948Z qemu-kvm: load of migration failed: Operation not permitted: libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2020-11-30T06:31:31.524866Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_soc
kets/vhu52667eae-a7,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu52667eae-a7,server
2020-11-30 06:31:40.257 7 DEBUG nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Migration operation thread notification thread_finished /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9144
2020-11-30 06:31:40.281 7 INFO nova.compute.manager [req-afa6db20-2e13-4fcf-bed6-606de5f0fc36 - - - - -] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] During sync_power_state the instance has a pending task (migrating). Skip.
2020-11-30 06:31:40.749 7 DEBUG nova.virt.libvirt.migration [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] VM running on src, migration failed _log /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:419
2020-11-30 06:31:40.750 7 DEBUG nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Fixed incorrect job type to be 4 _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8958
2020-11-30 06:31:40.750 7 ERROR nova.virt.libvirt.driver [-] [instance: 1102df92-4228-4ebe-855a-02fe4b0fec96] Migration operation has aborted
~~~


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
While upgrading osp13 compute node, migrate the workload to osp16.1 compute node

Actual results:
Workload migration fails, which prevents upgrading of the compute node to 16.1

Expected results:
Workload should migrate to osp16.1 compute node to perform upgrade

Additional info:

Comment 6 Ketan Mehta 2020-12-04 11:10:29 UTC

Hello,

I've encountered the similar problem during an attempt to live migrate a dpdk instance from an older compute node (on rhosp13) to an upgraded compute node (rhosp16.1).

Here is the error snippet:

~~~
2020-12-04 10:00:21.134 8 ERROR nova.virt.libvirt.driver [-] [instance: b129442b-e162-490f-b30c-7e1d99dce35b] Live Migration failure: internal error: qemu une
xpectedly closed the monitor: 2020-12-04T10:00:09.889529Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu25fdafef-e9,server: info: QEMU
waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu25fdafef-e9,server                                                                    
2020-12-04T10:00:12.080859Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card in
stead
2020-12-04T10:00:20.705320Z qemu-kvm: Features 0x130afe7a2 unsupported. Allowed features: 0x178bfa7e6                                                        
2020-12-04T10:00:20.705355Z qemu-kvm: Failed to load virtio-net:virtio
2020-12-04T10:00:20.705364Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net'                                         
2020-12-04T10:00:20.705722Z qemu-kvm: load of migration failed: Operation not permitted: libvirt.libvirtError: internal error: qemu unexpectedly closed the mo
nitor: 2020-12-04T10:00:09.889529Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu25fdafef-e9,server: info: QEMU waiting for connection
on: disconnected:unix:/var/lib/vhost_sockets/vhu25fdafef-e9,server
~~~

And the instance was rolled back to the source compute node, I've halted the upgrade process for now to complete the migration process first.

So, if there are any specific logs needed I can help with that information.

Qemu-kvm version:

On destination node: (RHEL 8.2, within the nova_libvirt container)

qemu-kvm-common-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
qemu-kvm-block-curl-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
qemu-kvm-core-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64

On source node:

qemu-kvm-common-rhev-2.12.0-48.el7_9.1.x86_64
qemu-kvm-rhev-2.12.0-48.el7_9.1.x86_64

Comment 9 Yariv 2020-12-07 10:04:50 UTC

(In reply to Ketan Mehta from comment #6)
> Hello,
> 
> I've encountered the similar problem during an attempt to live migrate a
> dpdk instance from an older compute node (on rhosp13) to an upgraded compute
> node (rhosp16.1).
> 
> Here is the error snippet:
> 
> ~~~
> 2020-12-04 10:00:21.134 8 ERROR nova.virt.libvirt.driver [-] [instance:
> b129442b-e162-490f-b30c-7e1d99dce35b] Live Migration failure: internal
> error: qemu une
> xpectedly closed the monitor: 2020-12-04T10:00:09.889529Z qemu-kvm: -chardev
> socket,id=charnet0,path=/var/lib/vhost_sockets/vhu25fdafef-e9,server: info:
> QEMU
> waiting for connection on:
> disconnected:unix:/var/lib/vhost_sockets/vhu25fdafef-e9,server              
> 
> 2020-12-04T10:00:12.080859Z qemu-kvm: -device
> cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is
> deprecated, please use a different VGA card in
> stead
> 2020-12-04T10:00:20.705320Z qemu-kvm: Features 0x130afe7a2 unsupported.
> Allowed features: 0x178bfa7e6                                               
> 
> 2020-12-04T10:00:20.705355Z qemu-kvm: Failed to load virtio-net:virtio
> 2020-12-04T10:00:20.705364Z qemu-kvm: error while loading state for instance
> 0x0 of device '0000:00:03.0/virtio-net'                                     
> 
> 2020-12-04T10:00:20.705722Z qemu-kvm: load of migration failed: Operation
> not permitted: libvirt.libvirtError: internal error: qemu unexpectedly
> closed the mo
> nitor: 2020-12-04T10:00:09.889529Z qemu-kvm: -chardev
> socket,id=charnet0,path=/var/lib/vhost_sockets/vhu25fdafef-e9,server: info:
> QEMU waiting for connection
> on: disconnected:unix:/var/lib/vhost_sockets/vhu25fdafef-e9,server
> ~~~
> 
> And the instance was rolled back to the source compute node, I've halted the
> upgrade process for now to complete the migration process first.
> 
> So, if there are any specific logs needed I can help with that information.
> 
> Qemu-kvm version:
> 
> On destination node: (RHEL 8.2, within the nova_libvirt container)
> 
> qemu-kvm-common-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
> qemu-kvm-block-curl-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
> qemu-kvm-core-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
> 
> On source node:
> 
> qemu-kvm-common-rhev-2.12.0-48.el7_9.1.x86_64
> qemu-kvm-rhev-2.12.0-48.el7_9.1.x86_64

Please refer this note:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/director_installation_and_usage/index#overcloud-storage

You have two options, use ceph/ceph/external for LVM, use case upgrade will pass without workloads only, it is tested but not supported used case.

For ovs+dpdk we have BZ with w/a https://bugzilla.redhat.com/show_bug.cgi?id=1895887

Comment 68 smooney 2021-01-27 13:51:47 UTC

sorry ment to update this on friday.

to avoid the proliferation of bug im going to close this as a duplicate of the existing DDF bug https://bugzilla.redhat.com/show_bug.cgi?id=1916869
in our internal call we confirmed the assertion that the work required to make this work is excessinve and would be an unresonable amount
of technical debt to maintainer given the vm would evenutally have to be hard rebooted anyway.

for that reason this will be address by a documentation update to not that livemigration with ovs-dpdk is not suport during FFU
due to cahnges required to ensure vms only negotiate valid offloads when using ovs-dpdk.

all aprochs we explored either require a vm reboot or a regression of the offload negociation fix followed by an eventual vm reboot
so it is our view that it is better to take thet reboot upfront in the form of a cold migration.

*** This bug has been marked as a duplicate of bug 1916869 ***

Comment 69 Red Hat Bugzilla 2023-09-15 00:52:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.