Bug 1115572

Summary:	drive-mirror with "mode":"existing" fails poorly if destination is not large enough
Product:	Red Hat Enterprise Linux 7	Reporter:	Eric Blake <eblake>
Component:	qemu-kvm-rhev	Assignee:	Ademar Reis <areis>
Status:	CLOSED DUPLICATE	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.0	CC:	amit.shah, berrange, cfergeau, crobinso, dwmw2, dyuan, hhuang, itamar, juzhang, pbonzini, rjones, scottt.tw, shyu, virt-maint
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1114793	Environment:
Last Closed:	2014-07-04 02:16:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1114793
Bug Blocks:

Description Eric Blake 2014-07-02 15:50:50 UTC

Cloning to RHEL. Libvirt would really like for qemu to error out if the destination is not big enough.  It may be possible for libvirt to do sanity checks itself if qemu is unpatched (at which point we would want to reassign this bug to libvirt), although it feels better to get it done at the bottom of the stack.

Also, the less-than-ideal error reporting highlights a design issue - if libvirt misses the BLOCK_JOB_ERROR event (such as across a libvirtd restart), the job just silently disappears, and libvirt has no idea if it succeeded or failed.  I've raised some of these concerns on the upstream list:
https://lists.gnu.org/archive/html/qemu-devel/2014-07/msg00268.html

where it was suggested that libvirt may need to start using rerror= and werror= settings to make sure the job sticks around rather than disappearing after errors, so we may need fixes in both programs after all.

+++ This bug was initially created as a clone of Bug #1114793 +++

Description of problem:
https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg07377.html
I tested on F20 with fedora-virt-preview, but suspect RHEL/RHEV may benefit from cloning this bug.  It would be nice when doing a diskcopy into an existing file if qemu would automatically resize the destination to be large enough, or at a bare minimum fail up front if the size is wrong. But the current behavior is to silently and successfully start the job, then fail when the destination is out of space; if management misses the 'BLOCK_JOB_COMPLETED with error' event, there is NO indication that the job failed or why.

Version-Release number of selected component (if applicable):
qemu-kvm-2.0.0-7.fc20.x86_64

How reproducible:
100%

Steps to Reproduce:
1.#!/bin/sh
cd /tmp

rm -f base.img snap1.img snap2.img copy.img
virsh destroy testvm1 2>/dev/null

# base.img <- snap1.img <- snap2.img
qemu-img create -f raw base.img 10M
qemu-img create -f qcow2 -b base.img -o backing_fmt=raw snap1.img
qemu-img create -f qcow2 -b snap1.img -o backing_fmt=qcow2 snap2.img
# set up blank space to hold the copy
touch copy.img
# cp base.img copy.img # uncomment this to see expected results

virsh create /dev/stdin <<EOF
<domain type='kvm'>
 <name>testvm1</name>
 <memory unit='MiB'>256</memory>
 <vcpu>1</vcpu>
 <os>
   <type arch='x86_64'>hvm</type>
 </os>
 <devices>
   <disk type='file' device='disk'>
     <driver name='qemu' type='qcow2'/>
     <source file='$PWD/snap2.img'/>
     <target dev='vda' bus='virtio'/>
   </disk>
   <graphics type='vnc'/>
 </devices>
</domain>
EOF

# check for events
virsh event testvm1 block-job --loop --timeout 10 &
pid=$!
sleep 1
# run the blockcopy
virsh blockcopy testvm1 vda --wait --verbose --raw /tmp/copy.img --reuse-external
echo job started
sleep 5
virsh blockjob testvm1 vda --abort
wait $pid


Actual results:
Block Copy: [  0 %]event 'block-job' for domain testvm1: Block Copy for /tmp/snap2.img failed

Now in mirroring phase
job started
event loop timed out
events received: 1

error: Requested operation is not valid: No active operation on device: drive-virtio-disk0



Expected results:
Block Copy: [  0 %]event 'block-job' for domain testvm1: Block Copy for /tmp/snap2.img ready
Block Copy: [100 %]
Now in mirroring phase
job started
event 'block-job' for domain testvm1: Block Copy for /tmp/snap2.img completed

event loop timed out
events received: 2


Additional info:

--- Additional comment from Cole Robinson on 2014-07-02 08:40:19 MDT ---

Fedora qemu bugs have much less visibility than those filed against RHEL. Since your mention of this issue on the mailing list didn't get a response yet, I'd suggest cloning or fully moving this issue to RHEL where resources are more likely to be allocated.

Comment 1 juzhang 2014-07-04 02:16:26 UTC


*** This bug has been marked as a duplicate of bug 1114962 ***