Hide Forgot
Created attachment 977192 [details] libvirtd+qemu logs on source and destination hosts. Description of problem: when migrate a vm with --copy-storage-all or --copy-storage-inc to a host with insufficient disk space, no error message produced. Version-Release number of selected component (if applicable): libvirt-1.2.8-11.el7.x86_64 qemu-kvm-rhev-2.1.2-17.el7.x86_64 kernel-3.10.0-220.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. prepare 2 hosts (hostA is migration source and hostB is migration destination) 2. On hostA, prepare a qcow2 image with a OS installed inside (Let's say /home/img/rhel7.qcow2), which is 9.0GB in size. [root@hostA ~]# qemu-img info /home/img/rhel7.qcow2 image: /home/img/r7.qcow2 file format: qcow2 virtual size: 9.0G (9663676416 bytes) disk size: 9.0G cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: true 3. On hostB, you can follow 3.1 OR 3.2 in this step. 3.1 prepare a 10GB blank image in the corresponding dir. But make sure hostB has no enough disk space to receive a 9GB file. [root@hostB ~]# df -h Filesystem Size Used Avail Use% Mounted on dev/sda1 40G 35G 2.4G 94% / 3.2 prepare a 5GB blank image in the corresonding dir in hostB. [root@hostB]# qemu-img create -f qcow2 /home/img/rhel7.qcow2 5G Formatting '/home/img/rhel7.qcow2', fmt=qcow2 size=5368709120 encryption=off cluster_size=65536 lazy_refcounts=off 4. On hostA Create VM (name=r7) with following xml and start it #virsh create r7.xml #virsh define r7.xml #virsh start r7 cat r7.xml <domain type='kvm'> <name>r7</name> <uuid>7d5a69ad-cf68-49b8-a94a-c8b7dca6afbd</uuid> <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <vcpu placement='static'>1</vcpu> <os> <type arch='x86_64' machine='rhel6.2.0'>hvm</type> <boot dev='hd'/> </os> <features> <acpi/> <apic/> <pae/> </features> <clock offset='utc'> <timer name='rtc' tickpolicy='catchup'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <pm> <suspend-to-mem enabled='no'/> <suspend-to-disk enabled='no'/> </pm> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/home/img/rhel7.qcow2'/> <target dev='hda' bus='ide'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> <controller type='usb' index='0' model='ich9-ehci1'> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x7'/> </controller> <controller type='usb' index='0' model='ich9-uhci1'> <master startport='0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0' multifunction='on'/> </controller> <controller type='usb' index='0' model='ich9-uhci2'> <master startport='2'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x1'/> </controller> <controller type='usb' index='0' model='ich9-uhci3'> <master startport='4'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x2'/> </controller> <controller type='pci' index='0' model='pci-root'/> <controller type='ide' index='0'> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> </controller> <controller type='virtio-serial' index='0'> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </controller> <interface type='network'> <mac address='52:54:00:1a:9f:14'/> <source network='default'/> <model type='rtl8139'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> <serial type='pty'> <target port='0'/> </serial> <console type='pty'> <target type='serial' port='0'/> </console> <channel type='spicevmc'> <target type='virtio' name='com.redhat.spice.0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <input type='mouse' bus='ps2'/> <input type='keyboard' bus='ps2'/> <graphics type='spice' autoport='yes'/> <graphics type='vnc' port='-1' autoport='yes' listen='0.0.0.0'> <listen type='address' address='0.0.0.0'/> </graphics> <sound model='ich6'> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </sound> <video> <model type='qxl' ram='65536' vram='65536' vgamem='8192' heads='1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <redirdev bus='usb' type='spicevmc'> </redirdev> <redirdev bus='usb' type='spicevmc'> </redirdev> <redirdev bus='usb' type='spicevmc'> </redirdev> <redirdev bus='usb' type='spicevmc'> </redirdev> <memballoon model='virtio'> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </memballoon> </devices> </domain> 5. On hostA Start migration process, either --copy-storage-all or --copy-storage-inc parameter can reproduce this issue. 5.1 [root@hostA ~]#virsh migrate --live r7 qemu+ssh://<hostB's IP>/system --verbose --copy-storage-all 5.2 [root@hostA ~]#virsh migrate --live r7 qemu+ssh://<hostB's IP>/system --verbose --copy-storage-inc 6. On hostB, Waiting for migration ends, and check info of /home/img/rhel7.qcow2 [root@hostB ~]# ll /home/img -h | grep rhel7.qcow2 -rw-r--r--. 1 qemu qemu 5.0G Jan 6 17:15 rhel7.qcow2 Actual Results: 1. No error message popped up to indicate that the "--copy-storage-all"/"--copy-storage-inc" failed. 2. Bad VM r7 is running on hostB and good VM r7 on hostA was shutoff. Expected Results: 1. Error message produced in source host terminal such as "copy storage failed, there is no enough disk space in /home/img/rhel7.qcow2" 2. When copy failed, source vm on hostA should not be shutdown, and destination vm on hostB should not be turned on. System should always rollback to previous "good" state after failure. Additional info: In dest host's qemu log, we can see: ... nbd.c:nbd_trip():L1142: writing to file failed block I/O error in device 'drive-ide0-0-0': No space left on device (28) All source and dest hosts logs attached. Please check if required.
Patch proposed upstream: https://www.redhat.com/archives/libvir-list/2015-January/msg00169.html
D'oh! The patch mentioned in comment 1, is for a different bug which I meant to update. Ignore it, please. The patch I proposed for this bug can be found here: https://www.redhat.com/archives/libvir-list/2015-January/msg00230.html
Another try: https://www.redhat.com/archives/libvir-list/2015-February/msg00358.html
And another try: https://www.redhat.com/archives/libvir-list/2015-February/msg00461.html
I've just pushed the patches upstream: commit 80c5f10e865cda0302519492f197cb020bd14a07 Author: Michal Privoznik <mprivozn> AuthorDate: Tue Feb 10 16:25:27 2015 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Feb 19 14:12:38 2015 +0100 qemuMigrationDriveMirror: Listen to events https://bugzilla.redhat.com/show_bug.cgi?id=1179678 When migrating with storage, libvirt iterates over domain disks and instruct qemu to migrate the ones we are interested in (shared, RO and source-less disks are skipped). The disks are migrated in series. No new disk is transferred until the previous one hasn't been quiesced. This is checked on the qemu monitor via 'query-jobs' command. If the disk has been quiesced, it practically went from copying its content to mirroring state, where all disk writes are mirrored to the other side of migration too. Having said that, there's one inherent error in the design. The monitor command we use reports only active jobs. So if the job fails for whatever reason, we will not see it anymore in the command output. And this can happen fairly simply: just try to migrate a domain with storage. If the storage migration fails (e.g. due to ENOSPC on the destination) we resume the host on the destination and let it run on partly copied disk. The proper fix is what even the comment in the code says: listen for qemu events instead of polling. If storage migration changes state an event is emitted and we can act accordingly: either consider disk copied and continue the process, or consider disk mangled and abort the migration. Signed-off-by: Michal Privoznik <mprivozn> commit 76c61cdca20c106960af033e5d0f5da70177af0f Author: Michal Privoznik <mprivozn> AuthorDate: Tue Feb 10 16:24:45 2015 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Feb 19 14:12:38 2015 +0100 qemuProcessHandleBlockJob: Take status into account Upon BLOCK_JOB_COMPLETED event delivery, we check if the job has completed (in qemuMonitorJSONHandleBlockJobImpl()). For better image, the event looks something like this: "timestamp": {"seconds": 1423582694, "microseconds": 372666}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive-virtio-disk0", "len": 8412790784, "offset": 409993216, "speed": 8796093022207, "type": "mirror", "error": "No space left on device"}} If "len" does not equal "offset" it's considered an error, and we can clearly see "error" field filled in. However, later in the event processing this case was handled no differently to case of job being aborted via separate API. It's time that we start differentiate these two because of the future work. Signed-off-by: Michal Privoznik <mprivozn> commit c37943a0687a8fdb08e6eda8ae4b9f4f43f4f2ed Author: Michal Privoznik <mprivozn> AuthorDate: Tue Feb 10 15:32:59 2015 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Thu Feb 19 14:12:38 2015 +0100 qemuProcessHandleBlockJob: Set disk->mirrorState more often Currently, upon BLOCK_JOB_* event, disk->mirrorState is not updated each time. The callback code handling the events checks if a blockjob was started via our public APIs prior to setting the mirrorState. However, some block jobs may be started internally (e.g. during storage migration), in which case we don't bother with setting disk->mirror (there's nothing we can set it to anyway), or other fields. But it will come handy if we update the mirrorState in these cases too. The event wasn't delivered just for fun - we've started the job after all. So, in this commit, the mirrorState is set to whatever job status we've obtained. Of course, there are some actions on some statuses that we want to perform. But instead of if {} else if {} else {} ... enumeration, let's move to switch(). Signed-off-by: Michal Privoznik <mprivozn> v1.2.12-155-g80c5f10
I can reproduce this. Test with build libvirt-1.2.17-7.el7.x86_64 1:prepare a guest, one storage size bigger than target's left disk space. 2:do storage migration # virsh migrate rhel7 --live qemu+ssh://$target_ip/system --verbose --migrate-disks vda,vdb --copy-storage-all error: cannot allocate 31457280000 bytes in file '/var/lib/libvirt/images/1.img': No space left on device # virsh migrate rhel7 --live qemu+ssh://$target_ip/system --verbose --copy-storage-all error: cannot allocate 31457280000 bytes in file '/var/lib/libvirt/images/1.img': No space left on device check guest on source machine: # virsh list --all Id Name State ---------------------------------------------------- 13 rhel7 running --copy-storage-inc is not support by now according this bug:1249587 move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html