Bug 1527532 - Unable to live migrate vm in DPDK environment
Summary: Unable to live migrate vm in DPDK environment
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 10.0 (Newton)
Assignee: Sahid Ferdjaoui
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On: 1542107 1543165 1543166
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-12-19 12:37 UTC by Eyal Dannon
Modified: 2019-09-09 14:26 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-05 07:16:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Compute-0 sosreport (10.19 MB, application/x-xz)
2017-12-19 12:37 UTC, Eyal Dannon
no flags Details
Compute-1 sosreport (10.35 MB, application/x-xz)
2017-12-19 12:39 UTC, Eyal Dannon
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1622270 0 None None None 2017-12-22 14:09:51 UTC

Description Eyal Dannon 2017-12-19 12:37:40 UTC
Created attachment 1370007 [details]
Compute-0 sosreport

Description of problem:
I'm trying to live migrate instance on OSPd10 DPDK environment using guide [1]

On the first try, migration failed with the following warning:

2017-12-19 08:17:26.293 20611 WARNING nova.virt.libvirt.driver [req-1c5f59ca-b9ff-4522-90eb-0dfe20f52b89 - - - - -] couldn't obtain the vcpu count from domain id: 5956b817-567f-4eb5-a6c5-bf640d7ae8f1, exception: Requested operation is not valid: cpu affinity is not supported

So I removed isolcpus parameter from the grub and disabled tuned service.

On second try, migration failed and the instance disappeared from virt.

2017-12-19 10:03:01.180 2698 ERROR oslo_messaging.rpc.server     raise exception.InstanceNotFound(instance_id=instance_name)
2017-12-19 10:03:01.180 2698 ERROR oslo_messaging.rpc.server InstanceNotFound: Instance instance-00000009 could not be found.

Huge pages have to be assigned to DPDK instance.

[stack@undercloud-0 ~]$ openstack flavor show m1.test | grep properties
| properties                 | hw:mem_page_size='large'             |


Instance gets into error state:
[stack@undercloud-0 ~]$ openstack server list
+-------------------+------+--------+-------------------+-------------------+
| ID                | Name | Status | Networks          | Image Name        |
+-------------------+------+--------+-------------------+-------------------+
| a6047aed-e1fb-    | test | ERROR  | mgmt=10.35.141.17 | rhel-guest-image- |
| 48c0-994f-        |      |        | 1                 | 7.3-36.x86_64.qco |
| 65130aa19363      |      |        |                   | w2                |
+-------------------+------+--------+-------------------+-------------------+

moreover, virsh display empty list

[root@compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------

[root@compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------

But available at dest host:
[root@compute-0 ~]# ll /var/lib/nova/instances/a6047aed-e1fb-48c0-994f-65130aa19363/

Those instances located on the hypervisor as mentioned above, when using --block-migration I'm getting the following error:
$ openstack server migrate a6047aed-e1fb-48c0-994f-65130aa19363 --block-migration  --live compute-0.localdomain
compute-1.localdomain is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-ff46ea00-5dd0-4096-9839-1172b5b9faa4)


sosreport for both compute nodes are attached.

Thanks!


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/director_installation_and_usage/#sect-Migrating_VMs_from_an_Overcloud_Compute_Node


Version-Release number of selected component (if applicable):
OSPd10
openstack-tripleo-heat-templates-5.3.3-1.el7ost.noarch
libvirt-3.2.0-14.el7_4.4.x86_6           4
openstack-nova-compute-14.0.8-5.el7ost.noarch


How reproducible:
Always

Steps to Reproduce:
1.Boot instance
2.openstack migrate to the another host
3.

Actual results:
Migration fails

Expected results:
Should move to second compute with CPU pinning and active tuned profile.

Additional info:
This migration works on OSP12 DPDK env.

Comment 1 Eyal Dannon 2017-12-19 12:39:22 UTC
Created attachment 1370008 [details]
Compute-1 sosreport

Comment 2 Sahid Ferdjaoui 2017-12-20 14:57:36 UTC
I can't find the bugzilla but I think there is a know issue with command 'openstack server migrate' when using '--block-migration' option.

Can you try with command "nova live-migrate --block-migration" ?

Also can you provide sosreport of source host since it's where we will have the error message returned by libvirt/QEMU.

Comment 3 Sahid Ferdjaoui 2017-12-20 14:59:22 UTC
Oh I see compute-1 is source host, but there is not a lot of information regarding an error related to a DPDK context. Please try with nova command.

Comment 4 Eyal Dannon 2017-12-25 08:01:55 UTC
Hi,

[stack@undercloud-0 ~]$ nova live-migration  --block-migrate test

Still getting into error state:
| OS-EXT-STS:vm_state                  | error

2017-12-25 07:58:33.407 2698 ERROR oslo_messaging.rpc.server InstanceNotFound: Instance instance-0000000b could not be found.
2017-12-25 07:58:33.407 2698 ERROR oslo_messaging.rpc.server
2017-12-25 07:58:45.182 2698 INFO nova.compute.manager [-] [instance: 6e269ae4-bfc7-4dd9-903a-8cd94e480271] VM Stopped (Lifecycle Event)

[root@compute-0 ~]# ll /var/lib/nova/instances/6e269ae4-bfc7-4dd9-903a-8cd94e480271/
total 56644
-rw-------. 1 root root        0 Dec 25 07:58 console.log
-rw-r--r--. 1 root root 57999360 Dec 25 07:58 disk
-rw-r--r--. 1 nova nova       78 Dec 25 07:58 disk.info

Any additional info I could provide? would you like to take a look at the setup?

Thanks,

Comment 5 Sahid Ferdjaoui 2018-01-05 12:48:51 UTC
It seems that some wrong happened during the post live migration step but the logs you have reported do not included DEBUG so we can't investigate more of that what was the root cause.

Can you configure nova.conf in debug and reproduce the case?

Comment 7 Sahid Ferdjaoui 2018-01-10 13:51:48 UTC
Hum so the instance is crashing in destination host.

...
2018-01-10T13:36:07.894624Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/1 (label charserial1)
2018-01-10T13:36:11.366484Z qemu-kvm: Not a migration stream
2018-01-10T13:36:11.366722Z qemu-kvm: load of migration failed: Invalid argument
2018-01-10 13:36:11.596+0000: shutting down, reason=crashed

I'm exchanging of that with dgilbert and continuing investigation...

Comment 8 Sahid Ferdjaoui 2018-01-10 16:16:49 UTC
[root@compute-1 ~]# ovs-vsctl show
e333b920-a3df-4a7f-9256-0fb90824e9c8
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port int-br-link
            Interface int-br-link
                type: patch
                options: {peer=phy-br-link}
        Port "vhue13713ea-58"
            tag: 8
            Interface "vhue13713ea-58"
                type: dpdkvhostuser
    Bridge br-link
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
        Port br-link
            Interface br-link
                type: internal
        Port phy-br-link
            Interface phy-br-link
                type: patch
                options: {peer=int-br-link}
    ovs_version: "2.6.1"

[root@compute-1 ~]# ovs-vsctl list interface vhue13713ea-58
_uuid               : 58955a74-51c3-4c7a-b92c-ce4e81aba761
admin_state         : up
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : []
external_ids        : {attached-mac="fa:16:3e:09:79:34", iface-id="e13713ea-58e2-4f2e-9e21-8923accdd0c4", iface-status=active, vm-uuid="1719f566-7903-4cff-8f75-3d2801f78f66"}
ifindex             : 0
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : up
lldp                : {}
mac                 : []
mac_in_use          : "00:00:00:00:00:00"
mtu                 : 1496
mtu_request         : 1496
name                : "vhue13713ea-58"
ofport              : 9
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {"rx_1024_to_1518_packets"=1, "rx_128_to_255_packets"=26, "rx_1523_to_max_packets"=0, "rx_1_to_64_packets"=16, "rx_256_to_511_packets"=4, "rx_512_to_1023_packets"=0, "rx_65_to_127_packets"=363, rx_bytes=38586, rx_dropped=0, rx_errors=0, rx_packets=409, tx_bytes=45541, tx_packets=468}
status              : {}
type                : dpdkvhostuser

[root@compute-1 ~]# cat /var/log/libvirt/qemu/instance-00000008.log 
2018-01-10 15:38:19.809+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-12-19-04:58:04, x86-041.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-16.el7_4.13), hostname: compute-1.localdomain
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000008,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-8-instance-00000008/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Client-IBRS,ss=on,hypervisor=on,tsc_adjust=on,pdpe1gb=on,mpx=off,xsavec=off,xgetbv1=off -m 4096 -realtime mlock=off -smp 6,sockets=3,cores=1,threads=2 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/8-instance-00000008,share=yes,size=4294967296,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0-5,memdev=ram-node0 -uuid 1719f566-7903-4cff-8f75-3d2801f78f66 -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=e7b3bfa8-30e2-42ce-95c6-58637aa201a5,uuid=1719f566-7903-4cff-8f75-3d2801f78f66,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-8-instance-00000008/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1719f566-7903-4cff-8f75-3d2801f78f66/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhue13713ea-58 -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:09:79:34,bus=pci.0,addr=0x3 -add-fd set=0,fd=27 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 10.100.120.112:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
2018-01-10T15:38:19.954121Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/1 (label charserial1)
2018-01-10 16:01:17.447+0000: initiating migration
2018-01-10T16:01:17.453777Z qemu-kvm: Failed to read msg header. Read -1 instead of 12. Original request 6.
2018-01-10T16:01:17.453998Z qemu-kvm: vhost_set_log_base failed: Input/output error (5)
2018-01-10T16:01:17.454060Z qemu-kvm: Failed to set msg fds.
2018-01-10T16:01:17.454076Z qemu-kvm: vhost_set_vring_addr failed: Invalid argument (22)
2018-01-10T16:01:17.454090Z qemu-kvm: Failed to set msg fds.
2018-01-10T16:01:17.454111Z qemu-kvm: vhost_set_vring_addr failed: Invalid argument (22)
2018-01-10T16:01:17.454125Z qemu-kvm: Failed to set msg fds.
2018-01-10T16:01:17.454138Z qemu-kvm: vhost_set_features failed: Invalid argument (22)
2018-01-10 16:01:17.697+0000: shutting down, reason=crashed

Based on errors it seems that we are hitting the "Issue2" of bug 1450680. I'm marking it as duplicate even if we do not have configured the interface to use 2 queues and do not have traffic in the guest.

*** This bug has been marked as a duplicate of bug 1450680 ***

Comment 14 Sahid Ferdjaoui 2018-04-05 07:16:46 UTC
I was unable to reproduce the issue with latest puddle I suspect an issue in OVS/DPDK configuration. Please re-open if necessary.


Note You need to log in before you can comment on or make changes to this bug.