Bug 612089

Summary:	virtio-blk: Possible loss of flushes after werror/rerror stop
Product:	Red Hat Enterprise Linux 6	Reporter:	Kevin Wolf <kwolf>
Component:	qemu-kvm	Assignee:	Kevin Wolf <kwolf>
Status:	CLOSED NOTABUG	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	low
Version:	6.0	CC:	chellwig, juzhang, mkenneth, tburke, virt-maint
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-07-16 08:57:28 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kevin Wolf 2010-07-07 10:01:18 UTC

Quoting an email thread:

> > 2. If you have several ENOSPACE events coming from several VCPU at once,
> >      will you be able to keep the ordering once the VM is resumed
> >     (assuming one of them is a barrier)
>
> I don't think this is related to VCPUs, but rather to the host block
> layer sending responses. The flush will probably be dropped in this
> scenario.

Dor suggested inserting a flush after restarting any requests just in case. Sounds like it's something that's possible to do and could help a bit, even though it's not entirely correct behaviour.

Comment 2 chellwig@redhat.com 2010-07-16 05:50:20 UTC

Analysis:

If we stop the VM on an error in virtio_blk_handle_rw_error we put the request we stop on in front
of the s->rq queue, which we then serialize in virtio_blk_save.

Once the VM is restarted we de-serialize them again in virtio_blk_load, and finally resume them in virtio_blk_dma_restart_bh.  We do this for all requests, including the flush requests.

So we defintively do handle flushes on VM abort / restart.  Now the questions is if we handle the requests (flush or not) in the right order.

 - in virtio_blk_handle_rw_error we add the newly failing request to the front of the queue,
   effectively reversing the order of the requests.
 - in virtio_blk_save we serialize the requests in the order found in the queue
 - in virtio_blk_load each request deserialized is added to the front of the s->rq queue, effectively reversing the order in which is was serialized
 - virtio_blk_dma_restart_bh then handles the requests in order.

So I think we are fine, but the way \ the s->rq queue is manipulated smells like this happend more by accident than by intention.

Comment 3 Kevin Wolf 2010-07-16 07:47:46 UTC

(In reply to comment #2)
> Analysis:
> 
> If we stop the VM on an error in virtio_blk_handle_rw_error we put the request
> we stop on in front
> of the s->rq queue, which we then serialize in virtio_blk_save.
> 
> Once the VM is restarted we de-serialize them again in virtio_blk_load, and
> finally resume them in virtio_blk_dma_restart_bh.  We do this for all requests,
> including the flush requests.
> 
> So we defintively do handle flushes on VM abort / restart.

Well, yes and no. If a flush fails, we put it into the queue, that's right. But that's not the whole truth, the flush may succeed and a prior write request may fail. Which should not be a problem as long as we signal the failure back to the guest, but when we restart the write request outselves the write succeeds from the guest POV, but it has moved across the barrier. So I think we need to flush once again after the restarted write request.

I think this is actually a pretty common case with RHEV: Writes won't succeed because of ENOSPC, but a flush doesn't need more space and still works.

Comment 4 chellwig@redhat.com 2010-07-16 08:34:36 UTC

(In reply to comment #3)
> Well, yes and no. If a flush fails, we put it into the queue, that's right. But
> that's not the whole truth, the flush may succeed and a prior write request may
> fail. Which should not be a problem as long as we signal the failure back to
> the guest, but when we restart the write request outselves the write succeeds
> from the guest POV, but it has moved across the barrier. So I think we need to
> flush once again after the restarted write request.

I/O ordering is a host problem.  Linux guests will never start a flush request before the queue
hasn't been drained to avoid dealing with the complex problem of out of order completions
for cache flushes.  While I can't verify this for other guests I can't see any other way to deal
with the typical error scenarios.

Comment 5 Kevin Wolf 2010-07-16 08:57:28 UTC

As Christoph explained in IRC, the guest not only does, but it even has to drain the queue before flushes to get the right semantics even during normal operation. Therefore this is not a problem, closing.