Bug 1211689

Summary:	atomic live snapshots are not atomic with dataplane-backed devices
Product:	Red Hat Enterprise Linux 7	Reporter:	Paolo Bonzini <pbonzini>
Component:	qemu-kvm-rhev	Assignee:	Fam Zheng <famz>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.1	CC:	amureini, areis, famz, huding, jcody, juzhang, knoel, michal.skrivanek, pbonzini, pezhang, stefanha, virt-maint, xfu
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.3.0-24.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-12-04 16:37:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1101577

Description Paolo Bonzini 2015-04-14 15:30:24 UTC

The idea is to add a bdrv_pause/bdrv_resume pair, where:

- bdrv_pause checks if there is an operation blocker for PAUSE.  if it is there, it fails
- otherwise, bdrv_pause invokes a notifier list if this is the outermost call. otherwise it does nothing
- bdrv_resume does the same, but does not need a blocker
- dataplane registers pause/resume notifiers for its BlockBackend, which respectively disconnect/reconnect the ioeventfd.
- callers of bdrv_drain/bdrv_drain_all are a good place where to add pause/resume.  Because bdrv_pause_all() makes little sense (e.g. if doing something related to the CPU, we do not care about pausing block job targets), bdrv_drain_all() in many cases should be replaced by bdrv_drain().
- for transactions, bdrv_drain_all() probably should be replaced by bdrv_drain() invocations in the prepare callback, so that the callback can also invoke bdrv_pause() before bdrv_drain().

Optionally:
- block jobs can register pause/resume notifiers for their targets, which pause/resume the job, or they can block PAUSE on their targets.  In particular, drive-backup should block PAUSE on its target, everything else can register notifiers.

-------------------------------------------------------------------------------

IRC conversation:

<bonzini> stefanha: iirc we never implemented "stop ioeventfd while the main context is acquiring an additional aiocontext", have we?
<stefanha> bonzini: The necessary AioContexts need to be acquired for the entire duration of the transaction,
<bonzini> stefanha: yes, but the VM can still write in the meanwhile
<stefanha> bonzini: I think AioContext acquire/release is oversimplifying the problem,
<stefanha> there are actually multiple mechanisms we need.
<stefanha> bonzini: In this case I think a device callback might make sense,
<bonzini> stefanha: yes, like a pause/restart notifier on AioContext
<stefanha> so the monitor can tell the device associated with a BlockBackend to stop submitting requests.
<bonzini> stefanha: exactly, pausing between acquire and bdrv_drain, and restarting before release
<stefanha> Yes
<stefanha> bonzini: there are two different concepts here:
<stefanha> 1. AioContext acquire/release is about thread-safety
<stefanha> 2. Device pause/resume is about quiescing I/O from the guest
<stefanha> I still think #1 is useful
<bonzini> stefanha: yes it definitely is. the important point for me is that thread-safety can be made as fine-grained as you want
<bonzini> stefanha: but pause/resume cannot be made fine-grained
<bonzini> stefanha: so far we had coarse-grained critical sections, so we could coalesce the two
<bonzini> stefanha: (could, but didn't since we never implemented #2 :D)
<stefanha> bonzini: We need to be careful about pause/resume
<stefanha> There are also timers and blockjobs
<stefanha> both could cause additional I/O
<stefanha> So even if we stop the device from submitting requests, we also need to pause blockjobs and maybe also timers.
<stefanha> We should do the same for QMP 'transaction'.
<bonzini> stefanha: it's in qemu-devel, i don't recall you merged it
<bonzini> stefanha: timers should be included in bdrv_drain_all, or disabled by other notifiers
<bonzini> stefanha: block jobs are interesting
<stefanha> e.g. drive-backup
<stefanha> It hooks into the write callback
<stefanha> The write request is paused until drive-backup has stashed away the data,
<stefanha> but we might hit a deadlock there if the blockjob is paused...
<stefanha> This is hard
* stefanha goes to write fizzbuzz instead
<bonzini> stefanha: perhaps pause/resume should have blockers too
<bonzini> stefanha: drive-backup can block pause/resume on the destination
<stefanha> bonzini: Yes, we can use op blockers
<bonzini> stefanha: and transaction can limit the pause/resume to things it has to operate on
<bonzini> stefanha: but streaming, for example, is compatible with pause/resume, because its writes do not change disk contents
<stefanha> So it's more nuanced than just 'writes are forbidden'

Comment 4 Miroslav Rezanina 2015-09-18 11:54:15 UTC

Fix included in qemu-kvm-rhev-2.3.0-24.el7

Comment 5 Pei Zhang 2015-09-22 02:25:25 UTC

Hi Paolo,
From KVM QE POV, could you please tell me how to verify this bug? Thanks.

Comment 6 Pei Zhang 2015-09-22 03:34:13 UTC

KVM QE tested:
Do three live snapshots in a transaction, and make the third one fail. qemu-kvm-rhev-2.3.0-23.el7.x86_64 and qemu-kvm-rhev-2.3.0-24.el7.x86_64 have the same results: all the live snapshots failed.

Versions: 
kernel:3.10.0-318.el7.x86_64
qemu-kvm-rhev:
(1)qemu-kvm-rhev-2.3.0-23.el7.x86_64 
(2)qemu-kvm-rhev-2.3.0-24.el7.x86_64


Steps:
1. boot guest with two data disks, and all disks with dataplane
# /usr/libexec/qemu-kvm -name rhel6.7 -machine pc-i440fx-rhel7.2.0,accel=kvm \
-cpu SandyBridge -m 2G,slots=256,maxmem=40G -numa node \
-smp 4,sockets=2,cores=2,threads=1 \
-uuid 82b1a01e-5f6c-4f5f-8d27-3855a74e6b6b \
-netdev tap,id=hostnet0 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:54:00:5c:88:6d \
-device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16 \
-spice port=5900,addr=0.0.0.0,disable-ticketing,image-compression=off,seamless-migration=on \
-monitor stdio \
-serial unix:/tmp/monitor,server,nowait \
-qmp tcp:0:5555,server,nowait \
-object iothread,id=iothread0 \
-drive file=/home/rhel6.7_virtio.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \
-device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0,iothread=iothread0 \
-object iothread,id=iothread1 \
-drive file=/home/data1.qcow2,format=qcow2,if=none,id=drive-virtio-blk1,werror=stop,rerror=stop \
-device virtio-blk-pci,drive=drive-virtio-blk1,id=virtio-blk1,iothread=iothread1 \
-object iothread,id=iothread2 \
-drive file=/home/data2.qcow2,format=qcow2,if=none,id=drive-virtio-blk2,werror=stop,rerror=stop \
-device virtio-blk-pci,drive=drive-virtio-blk2,id=virtio-blk2,iothread=iothread2 \


2. do 3 live snapshot in a transaction, and set the third one fail(unvalid path)
{ "execute": "transaction", "arguments": { "actions": [ { 'type': 'blockdev-snapshot-sync', 'data' : { "device": "drive-virtio-blk0",
"snapshot-file": "/home/snapshotA.qcow2", "format": "qcow2" } }, { 'type': 'blockdev-snapshot-sync', 'data' : { "device": "drive-virtio-blk1", "snapshot-file": "/home/snapshotB.qcow2", "mode": "absolute-paths", "format": "qcow2" } }, { 'type': 'blockdev-snapshot-sync', 'data' : { "device": "drive-virtio-blk2", "snapshot-file": "/homex/snapshotC.qcow2", "format": "qcow2" } }] } }


3. all 3 live snapshot failed
{"error": {"class": "GenericError", "desc": "Could not create file: No such file or directory"}}

# ls /home/
snapshotA.qcow2
snapshotB.qcow2
...

(qemu) info block
drive-virtio-blk0: /home/rhel6.7_virtio.qcow2 (qcow2)
    Cache mode:       writeback

drive-virtio-blk1: /home/data1.qcow2 (qcow2)
    Cache mode:       writeback

drive-virtio-blk2: /home/data2.qcow2 (qcow2)
    Cache mode:       writeback

...

Comment 8 juzhang 2015-09-28 02:29:18 UTC

According to comment6 and comment7, set this issue as verified.

Comment 10 errata-xmlrpc 2015-12-04 16:37:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2546.html