Bug 1329067

Summary: libvirt migration with --copy-storage-all limited to 32MB/s
Product: [Community] Virtualization Tools Reporter: Jason Tibbitts <j>
Component: libvirtAssignee: Libvirt Maintainers <libvirt-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: agedosier, berrange, clalancette, crobinso, itamar, jdenemar, j, laine, libvirt-maint, veillard, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-17 12:33:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Log from the source machine.
none
Log from the target machine none

Description Jason Tibbitts 2016-04-21 06:00:39 UTC
I'm migrating a bunch of VMs between hosts.  Everything is running the latest packages from the preview repository: libvirt 1.3.3-2 and qemu 2.6.0-0.1.rc2.

I've found that the storage copy portion of the migration is taking a good bit longer than I expect.  Running iftop shows things running at pretty much exactly 32MB/s, using about a third of a gigabit link.  My understanding is that 32MB/s is a common limit compiled into qemu for various things.

However, virsh blockjob run on the sending host gives the impression that it should be going a whole lot faster:

Block Copy: [ 82 %]    Bandwidth limit: 8796093022207 bytes/s (8.000 TiB/s)

I can't run blockjob on the receiving host:

error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params)

But that's pretty much the same thing I get when I try to do most operations on either host involved in the migration, so I guess it's not surprising.

Also of note is that the final portion of the migration which runs after the storage has been copied seems to saturate the link without issue.  So it's just the block copy operation that runs slowly.

Comment 1 Cole Robinson 2016-04-21 14:37:34 UTC
Were you passing any other options to 'virsh migrate' ? Please give the full command

Comment 2 Jason Tibbitts 2016-04-21 15:49:08 UTC
Of course I should have provided that.  All I was doing was:

virsh migrate --live --persistent --copy-storage-all --compressed --verbose --desturi qemu+ssh://root.uh.edu/system VMNAME

I tried without --compressed but didn't seem to make any difference.

Comment 3 Jason Tibbitts 2016-04-26 03:29:25 UTC
Just to make sure it isn't something related to a crappy IO subsystem on the destination host, I tried running two migrigrations in parallel (with the same source and same destination machines) and they go faster than one.  So this really does seem to be some internal limit on the transfer speed rather than something imposed by network or disk transfer rates.

Just to be sure, I have some 2TB SSDs and 10G Ethernet cards on the way, which should certainly remove network and disk IO from the equation.  Also note that CPU usage on both machines is only a couple of percent, and the smallest one has 96GB of RAM, almost entirely free.

Comment 4 Cole Robinson 2016-04-26 11:52:45 UTC
(In reply to Jason Tibbitts from comment #3)
> Just to make sure it isn't something related to a crappy IO subsystem on the
> destination host, I tried running two migrigrations in parallel (with the
> same source and same destination machines) and they go faster than one.  So
> this really does seem to be some internal limit on the transfer speed rather
> than something imposed by network or disk transfer rates.
> 
> Just to be sure, I have some 2TB SSDs and 10G Ethernet cards on the way,
> which should certainly remove network and disk IO from the equation.  Also
> note that CPU usage on both machines is only a couple of percent, and the
> smallest one has 96GB of RAM, almost entirely free.

jirka do you know anything about this?

Comment 5 Jiri Denemark 2016-04-29 14:36:04 UTC
Hmm, what does virsh migrate-getspeed report? And could you provide debug logs from libvirtd gathered during your migration attempt? http://wiki.libvirt.org/page/DebugLogs should explain how to get them.

Comment 6 Jason Tibbitts 2016-04-30 01:20:29 UTC
So I up a fresh machine with 3x 2TB SSDs, 96GB of RAM and 2xE5330 CPUs (quad core, 2.4GHz) which should be plenty.  I tried migrating two existing VMs to it.

A single migration runs at about 400Mb/s.  Two in parallel run at a total of about 700Mb/s.  Load on the destination machine stays under 0.3. 400Mb/s is actually a little faster than it was running when I first filed this ticket.  That makes me wonder if it could perhaps be related to I/O speed somehow.  Are the writes on the receiver synchronous?  If it were truly held up on disk I/O I would expect the machine to show more load.

migrate-getspeed gives me the expected absurdly large value: 8796093022207

I aborted those migrations and set up logging as indicated on the wiki page. I then started a single migration, waited for it to complete, and then turned logging off and restarted libvirtd on both the source and target machines.

Let me see if the logs are small enough to attach here.

Comment 7 Jason Tibbitts 2016-04-30 01:21:16 UTC
Created attachment 1152469 [details]
Log from the source machine.

Comment 8 Jason Tibbitts 2016-04-30 01:22:11 UTC
Created attachment 1152470 [details]
Log from the target machine

Comment 9 Jason Tibbitts 2016-04-30 01:56:27 UTC
Note that the migration starts at 01:03 and takes about 13 minutes.

Just for fun, I dd'd the logical volume holding the VM's storage to a file and scp'd it from the source to the target host.  The source has kinetic storage and was reading and writing on the same array so the dd itself took ten minutes.    

The scp took just under seven minutes and the load hit about 1.1 on the source, 2 on the target.  Obviously a good portion of that was the encryption overhead.

Comment 10 Jason Tibbitts 2016-08-31 20:41:29 UTC
I just wanted to note that I'm still seeing this issue.  I'm migrating from an F23 machine to an F24 machine; both machines have libvirt 1.3.3.2 and qemu 2.6.0.

As before, iftop reports about 400Mb/s of traffic, iotop reports around 50M/s write to a three-drive RAID1 array, and I can scp between the machines at line rate.  All drives are 2TB Samsung Pro SSDs which should be able to do better somewhat better than that.

I tried using migrate-setspeed to adjust the speed to a more real-world value.  I even tried to set it to something lower than the speed at which it's currently running.  None of that appeared to make any difference.

I'm not sure if there's any way to actually get into qemu (or the monitor) to see what its idea of the nbd parameters is.

Comment 11 Fedora End Of Life 2016-11-25 07:25:15 UTC
This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 12 Jason Tibbitts 2016-12-06 22:46:45 UTC
Just set up an F25 server, with all SSD and 10G Ethernet.  Migrating from an F24 machine (also with SSD and 10G Ethernet) is still limited to about 400Mbps.

Comment 13 Jason Tibbitts 2017-03-11 00:16:17 UTC
I don't know what changed or when, but I just needed to do a migration (F25->F25, 10G ethernet, SSDs on the source machine and 4-disk RAID1 on the target) and now things are moving around at between 800Mbps and 1Gps.  Far faster than it was.  

The memory migration is still much faster (4-5Gbps as far as I can tell; it happens quickly), but at this point I can't rule out disk write bandwidth of the target as being the limiting factor of the storage migration.  It's certainly more than twice as fast as it was, which saves me a lot of time.

It doesn't look like livirt has updated in F25 since December.  There were a couple of qemu updates in January (from 2.7.0 to 2.7.1, and then feom 2.7.1-1 to -2).  Maybe one of those had something to do with it.

Comment 14 Cole Robinson 2017-05-03 19:12:23 UTC
Thanks for following up. I don't really see anything obvious in qemu 2.7.1 changelog over 2.7.0, but I could be missing something. Could also be some change in a dependent library or something. I'll move this to the upstream tracker and we can close it in a while if it doesn't reappear

Comment 15 Daniel Berrangé 2020-04-17 12:33:21 UTC
Closing old bug, since reproducing was difficult and no further reports have arrived of this problem.