Bug 1372613 - Improve live block device job status reporting via virDomainBlockJobInfo()
Summary: Improve live block device job status reporting via virDomainBlockJobInfo()
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Michal Privoznik
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1443493
TreeView+ depends on / blocked
 
Reported: 2016-09-02 08:23 UTC by Kashyap Chamarthy
Modified: 2017-05-01 14:49 UTC (History)
5 users (show)

Fixed In Version: libvirt-2.3.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1443493 (view as bug list)
Environment:
Last Closed: 2016-09-14 10:47:41 UTC
Embargoed:


Attachments (Terms of Use)

Description Kashyap Chamarthy 2016-09-02 08:23:49 UTC
In OpenStack Nova, we're trying to analyze a potential race
condition[0].  The operation flow is something like: perform a live
shallow blockRebase(), check for progress with blockJobInfo(), followed
by a blockJobAbort() (QMP 'block-job-cancel') to get a live
point-in-time snapshot, then convert the resulting shallow copy with a
backing file into a flattenned image with (`qemu-img convert`), and Nova
uploads that into the image repository (Glance).  Nova uses some
wrappers around the above APIs, and calls them this way, in its
_live_snapshot() method[1]:

            [...]
            dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True)

            while dev.wait_for_job():
                time.sleep(0.5)

            dev.abort_job()
            [...]

Looking at a failed blockRebase() operation from libvirt debug log, the
root cause turns out to be libvirt issuing QMP command
'block-job-cancel' (blockJobAbort()) when the block device job status
is "ready": false, which results in a corrupted destination file.  (In
a successful case, QMP command 'block-job-cancel' should be issued when
the job status is: "ready": true).  

My libvirtd log analysis in the Nova bug is here[2] from a failed case,
which has QMP traffic back-n-forth.

    - - -

So, I'm trying to understand how libvirt reports the "cur" and "end"
values.  I've read the virDomainBlockJobInfo() struct, it wasn't crystal
clear.  It states:

/*
 * The following fields provide an indication of block job progress.  @cur
 * indicates the current position and will be between 0 and @end.  @end is
 * the final cursor position for this operation and represents completion.
 * To approximate progress, divide @cur by @end.
 */

Now, if a job hasn't started, what would libvirt report?  Talking to
Michal Privoznik on IRC:

[mprivozn]

    I wonder if libvirt should report something else than start=0 end=0
    in order to tell the caller that job hasn't been started yet.  I
    suspect that libvirt does not report correctly whether job has
    started already, which in turns forces Nova to use some workarounds.

[kashyap]

    So, if the job hasn't started yet, what should libvirt report? 

[mprivozn]

    That's the question. We can't change the [virDomainBlockJobInfo]
    struct (otherwise we won't be ABI compatible), so we can't really
    add a bolean there 'bool job_started'.

    Or we can introduce new "job type" which wouldn't really be a job
    type, but we will fill status.job with it to say explicitly job
    hasn't started yet


    - - -

I realize that a copy job has two phases, as clearly stated in the API
documenation of virDomainBlockRebase()[3]:

  "[...] The job transitions to the second phase when the job info
  states cur == end, and remains alive to mirror all further changes to
  both source and destination.  The user must call
  virDomainBlockJobAbort() to end the mirroring while choosing whether
  to revert to source or pivot to the destination. [...]"

Comment 1 Kashyap Chamarthy 2016-09-02 08:29:21 UTC
Eric Blake's response from the upstream libvirt mailing list thread, when I asked for any thoughts on how to improve things here:

http://www.redhat.com/archives/libvir-list/2016-September/msg00017.html

Re-posting it here:

On 09/01/2016 08:57 AM, Kashyap Chamarthy wrote:
> So, I'm trying to understand how libvirt reports the "cur" and "end"
> values.  I've read the virDomainBlockJobInfo() struct, it wasn't crystal
> clear.  It states:
> 
> /*
>  * The following fields provide an indication of block job progress.  @cur
>  * indicates the current position and will be between 0 and @end.  @end is
>  * the final cursor position for this operation and represents completion.
>  * To approximate progress, divide @cur by @end.
>  */

Libvirt is (more or less) reporting numbers from qemu.  What's more,
@end need not be the same between calls; qemu is free to change the end
value as it comes up with more work to do (or sees that less work
remains than initially estimated), all that REALLY matters is that the
ratio between the two numbers converges, and is < 1 while busy, and == 1
when complete.  Except that there are cases in qemu where a block job
really has 0 work to do.

Prior to qemu exposing the "ready":true/false flag, libvirt had to guess
whether equal numbers really meant done, or if it merely meant nearly
done.  But now that we have "ready", I think the sanest course of action
for libvirt is to fudge the numbers from qemu.  After all, we've already
documented (both in libvirt and in qemu) that @end is not fixed, so much
as a moving target (it just doesn't move very much on operations where
we have a good grasp on how much work remains from the start, like a
deep copy; but is more prone to move on operations like live commit that
are influenced by how active the guest is at writing data that we are
attempting to commit at the same time).

> 
> Now, if a job hasn't started, what would libvirt report?  Talking to
> Michal Privoznik on IRC:
> 
> [mprivozn]
> 
>     I wonder if libvirt should report something else than start=0 end=0
>     in order to tell the caller that job hasn't been started yet.  I
>     suspect that libvirt does not report correctly whether job has
>     started already, which in turns forces Nova to use some workarounds.
> 
> [kashyap]
> 
>     So, if the job hasn't started yet, what should libvirt report? 
> 
> [mprivozn]
> 
>     That's the question. We can't change the [virDomainBlockJobInfo]
>     struct (otherwise we won't be ABI compatible), so we can't really
>     add a bolean there 'bool job_started'.
> 
>     Or we can introduce new "job type" which wouldn't really be a job
>     type, but we will fill status.job with it to say explicitly job
>     hasn't started yet
> 
> 
> So, any other thoughts here, on how to proceed here?

My preference would be:

If qemu doesn't report anything (because the job is not started yet),
then libvirt should report cur=0, end=1 (the job still has 100% to go).

If qemu reports 0/0 and "done":false, then libvirt should report cur=0,
end=1 (that is, we fudge the end to be larger, because the job is not
done yet).

If qemu reports 0/0 and "done":true (because the job was really a
no-op), then libvirt should report cur=1, end=1 (the job is 100% complete).

If qemu reports 0/0 and lacks "done" (older qemu), then libvirt just has
to guess.  I'm not sure which guess is most appropriate; maybe libvirt
itself will have to set up a timer and report 0/1 the first time, and
only report 1/1 after a minimum time has elapsed, to make sure qemu has
had a chance to do something about the job.  Or maybe we don't worry
about it, and just have libvirt report 0/0 because we really don't know
any better.

If qemu reports a/b, where a < b and b > 0, use those numbers as is.  We
don't even have to check "done".

If qemu reports a/a, where a > 0, then also check "done". If
"done":false is present, report a-1/a (the job is not quite done); if
"done" is absent or "done":true is present, report a/a (the job is done).


> I realize that a copy job has two phases, as clearly stated in the API
> documenation of virDomainBlockRebase()[3]:
> 
>   "[...] The job transitions to the second phase when the job info
>   states cur == end, and remains alive to mirror all further changes to
>   both source and destination.  The user must call
>   virDomainBlockJobAbort() to end the mirroring while choosing whether
>   to revert to source or pivot to the destination. [...]"

We may also want to enhance the libvirt documentation to mention cur ==
non-zero end as the phase change point, mentioning that 0/0 is a special
case (hopefully reserved for older qemu).

Comment 2 Michal Privoznik 2016-09-14 10:47:41 UTC
I've just pushed the patches upstream:

commit 988218ca3f0de571d97d02365db9ffa0845c18da
Author:     Michal Privoznik <mprivozn>
AuthorDate: Fri Sep 2 09:45:44 2016 +0200
Commit:     Michal Privoznik <mprivozn>
CommitDate: Wed Sep 14 12:44:42 2016 +0200

    virDomainGetBlockJobInfo: Fix corner case when qemu reports no info
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1372613
    
    Apparently, some management applications use the following code
    pattern when waiting for a block job to finish:
    
      while (1) {
        virDomainGetBlockJobInfo(dom, disk, info, flags);
    
        if (info.cur == info.end)
            break;
    
        sleep(1);
      }
    
    Problem with this approach is in its corner cases. In case of
    QEMU, libvirt merely pass what has been reported on the monitor.
    However, if the block job hasn't started yet, qemu reports cur ==
    end == 0 which tricks mgmt apps into thinking job is complete.
    
    The solution is to mangle cur/end values as described here [1].
    
    1: https://www.redhat.com/archives/libvir-list/2016-September/msg00017.html
    
    Signed-off-by: Michal Privoznik <mprivozn>

commit 5d213b34de442490c66ee54b17d15e68cfeb2174
Author:     Michal Privoznik <mprivozn>
AuthorDate: Fri Sep 2 08:38:19 2016 +0200
Commit:     Michal Privoznik <mprivozn>
CommitDate: Wed Sep 14 12:44:42 2016 +0200

    qemuDomainGetBlockJobInfo: Move info translation into separate func
    
    Even though we merely just pass to users whatever qemu provided
    on the monitor, we still do some translation. For instance we
    turn bytes into mebibytes, or fix job type if needed. However, in
    the future there is more fixing to be done so this code deserves
    its own function.
    
    Signed-off-by: Michal Privoznik <mprivozn>

Comment 3 Nir Soffer 2017-04-19 07:08:26 UTC
Michal, is this fix available in 7.3?

We need this for bug 1442266.

Comment 4 Peter Krempa 2017-04-19 07:56:35 UTC
No. Also this is the upstream bugzilla.

Comment 5 Michal Privoznik 2017-05-01 14:49:40 UTC
Clearing up needinfo per comment 4.


Note You need to log in before you can comment on or make changes to this bug.