Bug 1382165 - virDomainGetBlockJobInfo: Adjust job reporting based on QEMU stats & the "ready" field of `query-block-jobs`
Summary: virDomainGetBlockJobInfo: Adjust job reporting based on QEMU stats & the "rea...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: All
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Peter Krempa
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-05 22:50 UTC by Kashyap Chamarthy
Modified: 2019-01-31 12:11 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-31 12:11:54 UTC
Embargoed:


Attachments (Terms of Use)

Description Kashyap Chamarthy 2016-10-05 22:50:17 UTC
Description of problem
----------------------

[Thanks to Eric Blake for the discussion on #virt, OFTC IRC, and for
exposing this bug.]

When virDomainBlockRebase() is invoked to start a copy job, then
aborting the said copy operation with virDomainBlockJobAbort() + flag
VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT can result in a potential race
condition (due to the way the virDomainGetBlockJobInfo() reports the job
status) where the pivot operation fails.


Race condition window
---------------------

libvirt finds cur==end AND sends a pivot request, all in the window
before QEMU would have sent "ready": true field [emitted as part of the
QMP `query-block-jobs` command's response, indicating that the job has
actually completed], however the pivot request fails because it requires
"ready": true.

So Eric suggests:

    QEMU 2.0 or 1.x probably had a synchronous setup where you could
    never observer cur==end on a non-ready job. But I don't remember
    enough history to point to when QEMU switched jobs to be a bit more
    asynchronous.  Maybe there was no qemu regression - maybe it was
    BECAUSE of other block-job additions in 2.2 that offset==len was no
    longer reliable.  I don't know that for sure.

    But what it DOES sound like is that IF qemu reports "ready": false,
    offset==len is not reliable, and libvirt should be taught to fudge
    that.

    And hopefully, QEMU too old to report "ready:" at all is reliable
    with regards to offset==len, because that's all we have to go by.

* * *

The "ready" boolean field (for `query-block-jobs`) was introduced via
QEMU commit ef6dbf1 (available in QEMU v2.2.0 or above):

    http://git.qemu.org/?p=qemu.git;a=commitdiff;h=ef6dbf1e4 --
    blockjob: Add "ready" field

And, libvirt was fixed to use the above field in this commit (available
in libvirt v1.2.18 or above):

    libvirt.org/git/?p=libvirt.git;a=commitdiff;h=eae5924 -- qemu:
    Update state of block job to READY only if it actually is ready


Additional Info
---------------

[Related IRC conversation snippet from #virt, OFTC.  Formatted for
readability.]

[eblake]

  Hmm, the libvirt commit (988218c - "virDomainGetBlockJobInfo: Fix
  corner case when qemu reports no info") you pointed to only fudges the
  0==0 case.  It is NOT fudging the cur==end, "ready": false case when
  cur != 0.
  
  I first have to be convinced that the race is a real problem - if QEMU
  reports len==offset and "ready": false, how much data can be lost?

[kashyap] Data is not lost, but the pivot fails in case of
          blockRebase().

[eblake]

  So the race IS biting us. Libvirt finds cur==end AND sends a pivot
  request, all in the window before QEMU would have sent ready:true.
  And the pivot request fails because it requires "ready" :true.

[kashyap] eblake: Exactly.

[eblake]

  It would be nice if QEMU were fixed to never send cur==end with
  "ready": false.  But since it does, it sounds like libvirt needs a BZ
  to fudge (easiest fudge is cur--) if "ready" :false

Comment 1 Kashyap Chamarthy 2016-10-05 23:28:12 UTC
Related Info
------------
The below recent commit mangles the libvirt cur/end values for the case
of cur == end == 0.  (Noting Eric's emphasis from Description that : "It
is NOT fudging the cur==end, "ready": false case when cur != 0")

  commit 988218ca3f0de571d97d02365db9ffa0845c18da
  Author: Michal Privoznik <mprivozn>
  Date:   Fri Sep 2 09:45:44 2016 +0200
  
      virDomainGetBlockJobInfo: Fix corner case when qemu reports no info
      
      https://bugzilla.redhat.com/show_bug.cgi?id=1372613
      
      Apparently, some management applications use the following code
      pattern when waiting for a block job to finish:
      
        while (1) {
          virDomainGetBlockJobInfo(dom, disk, info, flags);
      
          if (info.cur == info.end)
              break;
      
          sleep(1);
        }
      
      Problem with this approach is in its corner cases. In case of
      QEMU, libvirt merely pass what has been reported on the monitor.
      However, if the block job hasn't started yet, qemu reports cur ==
      end == 0 which tricks mgmt apps into thinking job is complete.
      
      The solution is to mangle cur/end values as described here [1].
      
      1: https://www.redhat.com/archives/libvir-list/2016-September/msg00017.html
      
      Signed-off-by: Michal Privoznik <mprivozn>


[1] http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=988218c

Comment 2 Peter Krempa 2016-10-06 11:39:49 UTC
The block job ready state is exposed via the domain XML for disk entries that have active copy jobs.

Additionally libvirt provides async events that are triggered once the block job reaches ready state and successfully pivots as it is reported by qemu.

This allows apps to figure out the precise moment when the pivot operation is safe and allow refresh the state in case there was potential downtime of the management app.

As of such polling virDomainGetBlockJobInfo for other than progress information is not necessary.

Comment 3 Kashyap Chamarthy 2016-10-06 12:49:34 UTC
Re-posting the response from Peter from the mailing list thread where he notes how the READY event is exposed via XML:

  https://www.redhat.com/archives/libvir-list/2016-October/msg00227.html

---------------------------------------------------------------------
We expose the state of the copy job in the XML and forward the READY
event from qemu to the users.

A running copy job exposes itself in the xml as:

    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/>
      <backingStore/>
      <mirror type='file' file='/tmp/ble.img' format='raw' job='copy'>
        <format type='raw'/>
        <source file='/tmp/ble.img'/>
      </mirror>
      [...]
    </disk>

While the ready copy job is exposed as:

    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/>
      <backingStore/>
      <mirror type='file' file='/tmp/ble.img' format='raw' job='copy' ready='yes'>
        <format type='raw'/>
        <source file='/tmp/ble.img'/>
      </mirror>
      [...]
    </disk>
---------------------------------------------------------------------

Comment 4 Kashyap Chamarthy 2016-10-06 14:51:55 UTC
[Given Eric's comments on the upstream mailing list thread, re-opening for further debate.]

https://www.redhat.com/archives/libvir-list/2016-October/msg00249.html

---------------------------------------------------------------------
On Thu, Oct 06, 2016 at 09:25:26AM -0500, Eric Blake wrote:
> On Thu, Oct 06, 2016 at 01:34:59PM +0200, Peter Krempa wrote:
>> On Thu, Oct 06, 2016 at 12:26:50 +0200, Kashyap Chamarthy wrote:

>> QMP `query-block-jobs` command's response, indicating that the job has
>> actually completed], however the pivot request fails because it requires
>> "ready": true.
> 
> The problem is that you are polling the block job info which correctly
> reports that all data was properly copied and you are inferring the
> block job state from that data.

But the problem here is that qemu is NOT accurately reporting data - it
is reporting cur==end with the promise that they are only equal if the
job is stable, WHEN THE JOB IS NOT STABLE.

> 
> I'm against deliberately reporting false data in the block info
> structure.

We are NOT falsifying any information, any more than we are falsifying
information by changing cur/end to 0/1 when ready:false and qemu
reported 0/0.  (see commit 988218ca).

> 
> The application should register handlers for the block job events and
> act only if it receives such event. Additionally you may want to check
> that the state is correct in the XML. The current block job information
> structure can't be extended unfortunately.

Yes, changing Nova to use event handlers is a good idea.  But I'm ALSO
in favor of fixing libvirt to work around the qemu bug, by intentionally
munging the output to state cur<end (even if qemu reported cur==end) if
qemu reports ready:false.
---------------------------------------------------------------------

Comment 5 Kashyap Chamarthy 2016-10-07 10:02:58 UTC
Following from mailing list discussion, the (reluctant?) conclusion is to fix virDomainGetBlockJobInfo() "to quit reporting cur==end when ready:false."

https://www.redhat.com/archives/libvir-list/2016-October/msg00287.html

---------------------------------------------------------------------
On Fri, Oct 07, 2016 at 09:09:06AM +0200, Peter Krempa wrote:
> On Thu, Oct 06, 2016 at 15:51:38 -0500, Eric Blake wrote:

[...]

> > The existing virDomainGetBlockJobInfo() can't be extended, but it can be
> > fixed to quit reporting cur==end when ready:false.
> 
> Yes, I agree about this one (although I don't really like it [1]), but
> this one will actually fix software not listening for events without any
> change.
> 
> Any new API would not help since the apps would need to change anyways
> thus can use the current correct approach right away even with older
> libvirt versions.
> 
> Peter
> 
> [1]: I'm expecting users to start complaining: "Why is this last byte of
> my image taking so long to copy after the rest copied pretty quickly".
---------------------------------------------------------------------

Comment 6 Kashyap Chamarthy 2016-11-02 11:42:35 UTC
(In reply to Kashyap Chamarthy from comment #5)

[...]

> > [1]: I'm expecting users to start complaining: "Why is this last byte of
> > my image taking so long to copy after the rest copied pretty quickly".

Eric Blake writes[1]:

<quote>
And our response is "We never promised that cur and end are bytes, only
rough status indicators. And we don't know why qemu is taking so long -
move the bug report to them" - if the user can even see this state long
enough for it to bother them.  (Nova is hitting it, because it is a
software-triggered reaction time, not a human in the mix; my
understanding is that it is still at most a fraction of a second where
we'd even have to do the fudging).
</quote>

[1] https://www.redhat.com/archives/libvir-list/2016-October/msg00347.html
---------------------------------------------------------------------

Comment 11 Peter Krempa 2019-01-29 16:16:14 UTC
commit 38757744c2e06ace3dc95f31ac1ab0c9c45ba6b7
Author: Peter Krempa <pkrempa>
Date:   Mon Jan 21 16:01:57 2019 +0100

    lib: domain: Emphasise that users should wait for block job READY state via events
    
    The transition to the ready state is best observed by events as it's
    ansynchronous and does not hint users to do polling. As currently only
    the qemu driver supports block copy and block commit and the ready state
    event was introduced by qemu 1.3 we can fully switch to the new
    approach.

Comment 12 Peter Krempa 2019-01-31 12:11:54 UTC
commit 73ce3911aa0404f070468ae00c78369a5b1ac071
Author: Peter Krempa <pkrempa>
Date:   Tue Jan 29 17:17:29 2019 +0100

    qemu: blockjob: Don't report block job progress at 100% if job isn't ready
    
    Some clients poll virDomainGetBlockJobInfo rather than wait for the
    VIR_DOMAIN_BLOCK_JOB_READY event. In some cases qemu can get to 100% and
    still not reach the synchronised phase. Initiating a pivot in that case
    will fail.
    
    Given that computers are interacting here, the error that the job
    can't be finalized yet is not handled very well by those specific
    implementations.
    
    Our docs now correctly state to use the event. We already do a similar
    output adjustment in case when the progress is not available from qemu
    as in that case we'd report 0 out of 0, which some apps also incorrectly
    considered as 100% complete.
    
    In this case we subtract 1 from the progress if the ready state is not
    signalled by qemu if the progress was at 100% otherwise.

v5.0.0-180-g73ce3911aa


Note You need to log in before you can comment on or make changes to this bug.