Bug 1298776 - DPDK Live migration using virsh introduced >500ms downtime
DPDK Live migration using virsh introduced >500ms downtime
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev (Show other bugs)
7.3
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Peter Xu
Virtualization Bugs
:
Depends On:
Blocks: 1175463 1193173 1305606 1313485
  Show dependency treegraph
 
Reported: 2016-01-14 21:38 EST by Peter Xu
Modified: 2016-09-01 22:44 EDT (History)
21 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-09-01 22:44:19 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/libvirt/qemu/migrate_vm.log for both hosts (17.08 KB, application/x-gzip)
2016-01-15 23:41 EST, Peter Xu
no flags Details
All the scripts used to verify the bz (with mig_mon client_rr and server_rr commands) (1.74 KB, application/x-gzip)
2016-09-01 04:24 EDT, Peter Xu
no flags Details

  None (edit)
Comment 7 Peter Xu 2016-01-15 23:41 EST
Created attachment 1115384 [details]
/var/log/libvirt/qemu/migrate_vm.log for both hosts
Comment 20 Jiri Denemark 2016-08-23 23:42:26 EDT
Libvirt doesn't set any downtime unless explicitly asked to, so the QEMU default is applied here.

The default speed set by libvirt is INT64_MAX on x86_64, which is 8P if I counted it correctly.
Comment 21 Dr. David Alan Gilbert 2016-08-24 15:17:17 EDT
I think if we're benchmarking downtime then it's best to set the bandwidth to something sensible;  I'm not sure it makes a difference but it feels right to do it.
I *think* qemu's default downtime is 300ms, so while it doesn't get you 500ms it does get you most of it!
Comment 22 Peter Xu 2016-08-24 23:40:35 EDT
(In reply to Dr. David Alan Gilbert from comment #21)
> I think if we're benchmarking downtime then it's best to set the bandwidth
> to something sensible;  I'm not sure it makes a difference but it feels
> right to do it.

Yes it sounds making sense. My old tests didn't take these parameters into account (all with default ones). That might be the reason why libvirt got different results (libvirt is setting speed to MAX, thanks Jiri for providing this info).

From now on I will play with sensible values for these two.

> I *think* qemu's default downtime is 300ms, so while it doesn't get you
> 500ms it does get you most of it!

The problem is why I was getting 500ms even I set downtime to 100ms.

One thing I want to do is enhance my mig_mon tool to at least use host time for measuring downtime, rather than use the time in the migrating guest, to avoid the possiblility that guest time may not be stable in some way.

One question that is totally not related to this bz: do we support postcopy for vhost-user migration? I played with it a bit and I got this:

qemu-kvm: postcopy_ram_discard_range MADV_DONTNEED: Invalid argument
qemu-kvm: load of migration failed: Operation not permitted
qemu-kvm: socket_writev_buffer: Got err=32 for (131788/18446744073709551615)

QEMU parameter is:

$qemu -enable-kvm -m 1024 \
      -monitor telnet::333${index},server,nowait \
      -chardev socket,id=char0,path=/usr/local/var/run/openvswitch/vhost-user1  \
      -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
      -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
      -object memory-backend-file,id=mem,size=1024M,mem-path=/dev/hugepages,share=on \
      -spice port=590${index},disable-ticketing \
      -numa node,memdev=mem -mem-prealloc \
      /root/remote/vm1.img \

Please just hint if there is quick answer (guest memory is on huge pages, and share enabled). Otherwise I'll check it out after I could figure out why precopy is getting these 500ms downs (I hope after I enhance my tool, the spike goes away).
Comment 23 Dr. David Alan Gilbert 2016-08-25 06:22:49 EDT
(In reply to Peter Xu from comment #22)
> (In reply to Dr. David Alan Gilbert from comment #21)
> > I think if we're benchmarking downtime then it's best to set the bandwidth
> > to something sensible;  I'm not sure it makes a difference but it feels
> > right to do it.
> 
> Yes it sounds making sense. My old tests didn't take these parameters into
> account (all with default ones). That might be the reason why libvirt got
> different results (libvirt is setting speed to MAX, thanks Jiri for
> providing this info).
> 
> From now on I will play with sensible values for these two.
> 
> > I *think* qemu's default downtime is 300ms, so while it doesn't get you
> > 500ms it does get you most of it!
> 
> The problem is why I was getting 500ms even I set downtime to 100ms.
> 
> One thing I want to do is enhance my mig_mon tool to at least use host time
> for measuring downtime, rather than use the time in the migrating guest, to
> avoid the possiblility that guest time may not be stable in some way.

Oh yes, I wouldn't trust guest time for that.

> One question that is totally not related to this bz: do we support postcopy
> for vhost-user migration? I played with it a bit and I got this:
> 
> qemu-kvm: postcopy_ram_discard_range MADV_DONTNEED: Invalid argument
> qemu-kvm: load of migration failed: Operation not permitted
> qemu-kvm: socket_writev_buffer: Got err=32 for (131788/18446744073709551615)

I've not tried vhost-user, but....

> QEMU parameter is:
> 
> $qemu -enable-kvm -m 1024 \
>       -monitor telnet::333${index},server,nowait \
>       -chardev
> socket,id=char0,path=/usr/local/var/run/openvswitch/vhost-user1  \
>       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>       -object
> memory-backend-file,id=mem,size=1024M,mem-path=/dev/hugepages,share=on \

We don't support huge page mapping in postcopy; so that's the most likely cause of that error.

>       -spice port=590${index},disable-ticketing \
>       -numa node,memdev=mem -mem-prealloc \
>       /root/remote/vm1.img \
> 
> Please just hint if there is quick answer (guest memory is on huge pages,
> and share enabled). Otherwise I'll check it out after I could figure out why
> precopy is getting these 500ms downs (I hope after I enhance my tool, the
> spike goes away).
Comment 24 Jiri Denemark 2016-08-31 09:20:47 EDT
It looks like this issue can be reproduced even without libvirt and the investigation is ongoing there anyway... moving to qemu-kvm-rhev.
Comment 25 Peter Xu 2016-09-01 04:16:23 EDT
I enhanced my testing script on measuring downtime:

https://github.com/xzpeter/clibs/blob/master/bsd/mig_mon/mig_mon.c

And provided a new way to measure the downtime in commit:

https://github.com/xzpeter/clibs/commit/81e6570c04c4d934e5b6165287e6a246bd5fadb3

After using the new tool, the spikes are gone.

----------------------------------------------

Here are the changed steps to run the test:

1. on two hosts, install latest ovs (dd52de45b719da1e52cc6894e245198fda5a748e, 2016-08-10). Need to download dpdk-16.07.zip first, compile DPDK (commenting out *KNI* entries in .config), compile OVS, and install OVS.

2. Install all the testing programs on host1 and guest (scripts will be uploaded later, mig_mon should be compiled from above source).

3. Make sure each of the two hosts have a 10G card, two ports (p2p1, p2p2) are connected directly. In this test, I am using p2p1 to connect to OVS vswitch, and using p2p2 to transfer live migration data (I need to pre-configure IP for p2p2, in my case 1.2.4.10/24 and 1.2.4.11/24 on two hosts correspondingly).

4. Run "prepare_migration.sh" on each of the two hosts: this will setup OVS vswitchs on each host. Also, do the NFS mounting, etc.

5. Run "start_migration.sh" on host1, wait for guest to boot up

6. In the guest, run:

  # ./mig_mon server_rr

7. In host 1, run

  # ./mig_mon client_rr 1.2.3.4 30

  Here 1.2.3.4 is guest IP, 30 (ms) is interval to send UDP package (also, the timeout for each UDP receive)

8. Hit enter in "start_migration.sh" to let the test continue. It will do ping-pong migration between two hosts, while downtime is measured using mig_mon along the way. 

Using "server_rr" and "client_rr" command of mig_mon, no spike is observed (it will capture all spike > 30ms*2=60ms). Actually what I saw is that maximum downtime is 33ms. This satisfy our basic need.
Comment 26 Peter Xu 2016-09-01 04:20:56 EDT
So basically I am 99% sure that the bz is caused by incorrect measurement on downtime (e.g., sampling timestamp in the moving guest, instead, I should sample the time in a stable host). The only thing missing is to confirm the problem, and why time shifted. 

However, that's another story (and I actually do not sure that we can provide a very stable timing in migrating guests if without the help of NTP or something alike). So if no one disagree, I would like to mark this bz as NOTABUG.
Comment 27 Peter Xu 2016-09-01 04:24 EDT
Created attachment 1196620 [details]
All the scripts used to verify the bz (with mig_mon client_rr and server_rr commands)

Note You need to log in before you can comment on or make changes to this bug.