Hide Forgot
I'm migrating a bunch of VMs between hosts. Everything is running the latest packages from the preview repository: libvirt 1.3.3-2 and qemu 2.6.0-0.1.rc2. I've found that the storage copy portion of the migration is taking a good bit longer than I expect. Running iftop shows things running at pretty much exactly 32MB/s, using about a third of a gigabit link. My understanding is that 32MB/s is a common limit compiled into qemu for various things. However, virsh blockjob run on the sending host gives the impression that it should be going a whole lot faster: Block Copy: [ 82 %] Bandwidth limit: 8796093022207 bytes/s (8.000 TiB/s) I can't run blockjob on the receiving host: error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params) But that's pretty much the same thing I get when I try to do most operations on either host involved in the migration, so I guess it's not surprising. Also of note is that the final portion of the migration which runs after the storage has been copied seems to saturate the link without issue. So it's just the block copy operation that runs slowly.
Were you passing any other options to 'virsh migrate' ? Please give the full command
Of course I should have provided that. All I was doing was: virsh migrate --live --persistent --copy-storage-all --compressed --verbose --desturi qemu+ssh://root.uh.edu/system VMNAME I tried without --compressed but didn't seem to make any difference.
Just to make sure it isn't something related to a crappy IO subsystem on the destination host, I tried running two migrigrations in parallel (with the same source and same destination machines) and they go faster than one. So this really does seem to be some internal limit on the transfer speed rather than something imposed by network or disk transfer rates. Just to be sure, I have some 2TB SSDs and 10G Ethernet cards on the way, which should certainly remove network and disk IO from the equation. Also note that CPU usage on both machines is only a couple of percent, and the smallest one has 96GB of RAM, almost entirely free.
(In reply to Jason Tibbitts from comment #3) > Just to make sure it isn't something related to a crappy IO subsystem on the > destination host, I tried running two migrigrations in parallel (with the > same source and same destination machines) and they go faster than one. So > this really does seem to be some internal limit on the transfer speed rather > than something imposed by network or disk transfer rates. > > Just to be sure, I have some 2TB SSDs and 10G Ethernet cards on the way, > which should certainly remove network and disk IO from the equation. Also > note that CPU usage on both machines is only a couple of percent, and the > smallest one has 96GB of RAM, almost entirely free. jirka do you know anything about this?
Hmm, what does virsh migrate-getspeed report? And could you provide debug logs from libvirtd gathered during your migration attempt? http://wiki.libvirt.org/page/DebugLogs should explain how to get them.
So I up a fresh machine with 3x 2TB SSDs, 96GB of RAM and 2xE5330 CPUs (quad core, 2.4GHz) which should be plenty. I tried migrating two existing VMs to it. A single migration runs at about 400Mb/s. Two in parallel run at a total of about 700Mb/s. Load on the destination machine stays under 0.3. 400Mb/s is actually a little faster than it was running when I first filed this ticket. That makes me wonder if it could perhaps be related to I/O speed somehow. Are the writes on the receiver synchronous? If it were truly held up on disk I/O I would expect the machine to show more load. migrate-getspeed gives me the expected absurdly large value: 8796093022207 I aborted those migrations and set up logging as indicated on the wiki page. I then started a single migration, waited for it to complete, and then turned logging off and restarted libvirtd on both the source and target machines. Let me see if the logs are small enough to attach here.
Created attachment 1152469 [details] Log from the source machine.
Created attachment 1152470 [details] Log from the target machine
Note that the migration starts at 01:03 and takes about 13 minutes. Just for fun, I dd'd the logical volume holding the VM's storage to a file and scp'd it from the source to the target host. The source has kinetic storage and was reading and writing on the same array so the dd itself took ten minutes. The scp took just under seven minutes and the load hit about 1.1 on the source, 2 on the target. Obviously a good portion of that was the encryption overhead.
I just wanted to note that I'm still seeing this issue. I'm migrating from an F23 machine to an F24 machine; both machines have libvirt 1.3.3.2 and qemu 2.6.0. As before, iftop reports about 400Mb/s of traffic, iotop reports around 50M/s write to a three-drive RAID1 array, and I can scp between the machines at line rate. All drives are 2TB Samsung Pro SSDs which should be able to do better somewhat better than that. I tried using migrate-setspeed to adjust the speed to a more real-world value. I even tried to set it to something lower than the speed at which it's currently running. None of that appeared to make any difference. I'm not sure if there's any way to actually get into qemu (or the monitor) to see what its idea of the nbd parameters is.
This message is a reminder that Fedora 23 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 23. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '23'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 23 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Just set up an F25 server, with all SSD and 10G Ethernet. Migrating from an F24 machine (also with SSD and 10G Ethernet) is still limited to about 400Mbps.
I don't know what changed or when, but I just needed to do a migration (F25->F25, 10G ethernet, SSDs on the source machine and 4-disk RAID1 on the target) and now things are moving around at between 800Mbps and 1Gps. Far faster than it was. The memory migration is still much faster (4-5Gbps as far as I can tell; it happens quickly), but at this point I can't rule out disk write bandwidth of the target as being the limiting factor of the storage migration. It's certainly more than twice as fast as it was, which saves me a lot of time. It doesn't look like livirt has updated in F25 since December. There were a couple of qemu updates in January (from 2.7.0 to 2.7.1, and then feom 2.7.1-1 to -2). Maybe one of those had something to do with it.
Thanks for following up. I don't really see anything obvious in qemu 2.7.1 changelog over 2.7.0, but I could be missing something. Could also be some change in a dependent library or something. I'll move this to the upstream tracker and we can close it in a while if it doesn't reappear
Closing old bug, since reproducing was difficult and no further reports have arrived of this problem.