Bug 1298776

Summary: DPDK Live migration using virsh introduced >500ms downtime
Product: Red Hat Enterprise Linux 7 Reporter: Peter Xu <peterx>
Component: qemu-kvm-rhevAssignee: Peter Xu <peterx>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: berrange, dgilbert, dyuan, hhuang, huding, jean-mickael.guerin, jsuchane, juzhang, knoel, lhuang, mgandolf, peterx, pezhang, rbalakri, samuel.gauthier, thibaut.collet, vincent.jardin, virt-maint, weliao, xfu, zpeng
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-02 02:44:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1175463, 1193173, 1305606, 1313485    
Attachments:
Description Flags
/var/log/libvirt/qemu/migrate_vm.log for both hosts
none
All the scripts used to verify the bz (with mig_mon client_rr and server_rr commands) none

Comment 7 Peter Xu 2016-01-16 04:41:50 UTC
Created attachment 1115384 [details]
/var/log/libvirt/qemu/migrate_vm.log for both hosts

Comment 20 Jiri Denemark 2016-08-24 03:42:26 UTC
Libvirt doesn't set any downtime unless explicitly asked to, so the QEMU default is applied here.

The default speed set by libvirt is INT64_MAX on x86_64, which is 8P if I counted it correctly.

Comment 21 Dr. David Alan Gilbert 2016-08-24 19:17:17 UTC
I think if we're benchmarking downtime then it's best to set the bandwidth to something sensible;  I'm not sure it makes a difference but it feels right to do it.
I *think* qemu's default downtime is 300ms, so while it doesn't get you 500ms it does get you most of it!

Comment 22 Peter Xu 2016-08-25 03:40:35 UTC
(In reply to Dr. David Alan Gilbert from comment #21)
> I think if we're benchmarking downtime then it's best to set the bandwidth
> to something sensible;  I'm not sure it makes a difference but it feels
> right to do it.

Yes it sounds making sense. My old tests didn't take these parameters into account (all with default ones). That might be the reason why libvirt got different results (libvirt is setting speed to MAX, thanks Jiri for providing this info).

From now on I will play with sensible values for these two.

> I *think* qemu's default downtime is 300ms, so while it doesn't get you
> 500ms it does get you most of it!

The problem is why I was getting 500ms even I set downtime to 100ms.

One thing I want to do is enhance my mig_mon tool to at least use host time for measuring downtime, rather than use the time in the migrating guest, to avoid the possiblility that guest time may not be stable in some way.

One question that is totally not related to this bz: do we support postcopy for vhost-user migration? I played with it a bit and I got this:

qemu-kvm: postcopy_ram_discard_range MADV_DONTNEED: Invalid argument
qemu-kvm: load of migration failed: Operation not permitted
qemu-kvm: socket_writev_buffer: Got err=32 for (131788/18446744073709551615)

QEMU parameter is:

$qemu -enable-kvm -m 1024 \
      -monitor telnet::333${index},server,nowait \
      -chardev socket,id=char0,path=/usr/local/var/run/openvswitch/vhost-user1  \
      -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
      -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
      -object memory-backend-file,id=mem,size=1024M,mem-path=/dev/hugepages,share=on \
      -spice port=590${index},disable-ticketing \
      -numa node,memdev=mem -mem-prealloc \
      /root/remote/vm1.img \

Please just hint if there is quick answer (guest memory is on huge pages, and share enabled). Otherwise I'll check it out after I could figure out why precopy is getting these 500ms downs (I hope after I enhance my tool, the spike goes away).

Comment 23 Dr. David Alan Gilbert 2016-08-25 10:22:49 UTC
(In reply to Peter Xu from comment #22)
> (In reply to Dr. David Alan Gilbert from comment #21)
> > I think if we're benchmarking downtime then it's best to set the bandwidth
> > to something sensible;  I'm not sure it makes a difference but it feels
> > right to do it.
> 
> Yes it sounds making sense. My old tests didn't take these parameters into
> account (all with default ones). That might be the reason why libvirt got
> different results (libvirt is setting speed to MAX, thanks Jiri for
> providing this info).
> 
> From now on I will play with sensible values for these two.
> 
> > I *think* qemu's default downtime is 300ms, so while it doesn't get you
> > 500ms it does get you most of it!
> 
> The problem is why I was getting 500ms even I set downtime to 100ms.
> 
> One thing I want to do is enhance my mig_mon tool to at least use host time
> for measuring downtime, rather than use the time in the migrating guest, to
> avoid the possiblility that guest time may not be stable in some way.

Oh yes, I wouldn't trust guest time for that.

> One question that is totally not related to this bz: do we support postcopy
> for vhost-user migration? I played with it a bit and I got this:
> 
> qemu-kvm: postcopy_ram_discard_range MADV_DONTNEED: Invalid argument
> qemu-kvm: load of migration failed: Operation not permitted
> qemu-kvm: socket_writev_buffer: Got err=32 for (131788/18446744073709551615)

I've not tried vhost-user, but....

> QEMU parameter is:
> 
> $qemu -enable-kvm -m 1024 \
>       -monitor telnet::333${index},server,nowait \
>       -chardev
> socket,id=char0,path=/usr/local/var/run/openvswitch/vhost-user1  \
>       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>       -object
> memory-backend-file,id=mem,size=1024M,mem-path=/dev/hugepages,share=on \

We don't support huge page mapping in postcopy; so that's the most likely cause of that error.

>       -spice port=590${index},disable-ticketing \
>       -numa node,memdev=mem -mem-prealloc \
>       /root/remote/vm1.img \
> 
> Please just hint if there is quick answer (guest memory is on huge pages,
> and share enabled). Otherwise I'll check it out after I could figure out why
> precopy is getting these 500ms downs (I hope after I enhance my tool, the
> spike goes away).

Comment 24 Jiri Denemark 2016-08-31 13:20:47 UTC
It looks like this issue can be reproduced even without libvirt and the investigation is ongoing there anyway... moving to qemu-kvm-rhev.

Comment 25 Peter Xu 2016-09-01 08:16:23 UTC
I enhanced my testing script on measuring downtime:

https://github.com/xzpeter/clibs/blob/master/bsd/mig_mon/mig_mon.c

And provided a new way to measure the downtime in commit:

https://github.com/xzpeter/clibs/commit/81e6570c04c4d934e5b6165287e6a246bd5fadb3

After using the new tool, the spikes are gone.

----------------------------------------------

Here are the changed steps to run the test:

1. on two hosts, install latest ovs (dd52de45b719da1e52cc6894e245198fda5a748e, 2016-08-10). Need to download dpdk-16.07.zip first, compile DPDK (commenting out *KNI* entries in .config), compile OVS, and install OVS.

2. Install all the testing programs on host1 and guest (scripts will be uploaded later, mig_mon should be compiled from above source).

3. Make sure each of the two hosts have a 10G card, two ports (p2p1, p2p2) are connected directly. In this test, I am using p2p1 to connect to OVS vswitch, and using p2p2 to transfer live migration data (I need to pre-configure IP for p2p2, in my case 1.2.4.10/24 and 1.2.4.11/24 on two hosts correspondingly).

4. Run "prepare_migration.sh" on each of the two hosts: this will setup OVS vswitchs on each host. Also, do the NFS mounting, etc.

5. Run "start_migration.sh" on host1, wait for guest to boot up

6. In the guest, run:

  # ./mig_mon server_rr

7. In host 1, run

  # ./mig_mon client_rr 1.2.3.4 30

  Here 1.2.3.4 is guest IP, 30 (ms) is interval to send UDP package (also, the timeout for each UDP receive)

8. Hit enter in "start_migration.sh" to let the test continue. It will do ping-pong migration between two hosts, while downtime is measured using mig_mon along the way. 

Using "server_rr" and "client_rr" command of mig_mon, no spike is observed (it will capture all spike > 30ms*2=60ms). Actually what I saw is that maximum downtime is 33ms. This satisfy our basic need.

Comment 26 Peter Xu 2016-09-01 08:20:56 UTC
So basically I am 99% sure that the bz is caused by incorrect measurement on downtime (e.g., sampling timestamp in the moving guest, instead, I should sample the time in a stable host). The only thing missing is to confirm the problem, and why time shifted. 

However, that's another story (and I actually do not sure that we can provide a very stable timing in migrating guests if without the help of NTP or something alike). So if no one disagree, I would like to mark this bz as NOTABUG.

Comment 27 Peter Xu 2016-09-01 08:24:18 UTC
Created attachment 1196620 [details]
All the scripts used to verify the bz (with mig_mon client_rr and server_rr commands)