In OpenStack Nova, we're trying to analyze a potential race condition[0]. The operation flow is something like: perform a live shallow blockRebase(), check for progress with blockJobInfo(), followed by a blockJobAbort() (QMP 'block-job-cancel') to get a live point-in-time snapshot, then convert the resulting shallow copy with a backing file into a flattenned image with (`qemu-img convert`), and Nova uploads that into the image repository (Glance). Nova uses some wrappers around the above APIs, and calls them this way, in its _live_snapshot() method[1]: [...] dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True) while dev.wait_for_job(): time.sleep(0.5) dev.abort_job() [...] Looking at a failed blockRebase() operation from libvirt debug log, the root cause turns out to be libvirt issuing QMP command 'block-job-cancel' (blockJobAbort()) when the block device job status is "ready": false, which results in a corrupted destination file. (In a successful case, QMP command 'block-job-cancel' should be issued when the job status is: "ready": true). My libvirtd log analysis in the Nova bug is here[2] from a failed case, which has QMP traffic back-n-forth. - - - So, I'm trying to understand how libvirt reports the "cur" and "end" values. I've read the virDomainBlockJobInfo() struct, it wasn't crystal clear. It states: /* * The following fields provide an indication of block job progress. @cur * indicates the current position and will be between 0 and @end. @end is * the final cursor position for this operation and represents completion. * To approximate progress, divide @cur by @end. */ Now, if a job hasn't started, what would libvirt report? Talking to Michal Privoznik on IRC: [mprivozn] I wonder if libvirt should report something else than start=0 end=0 in order to tell the caller that job hasn't been started yet. I suspect that libvirt does not report correctly whether job has started already, which in turns forces Nova to use some workarounds. [kashyap] So, if the job hasn't started yet, what should libvirt report? [mprivozn] That's the question. We can't change the [virDomainBlockJobInfo] struct (otherwise we won't be ABI compatible), so we can't really add a bolean there 'bool job_started'. Or we can introduce new "job type" which wouldn't really be a job type, but we will fill status.job with it to say explicitly job hasn't started yet - - - I realize that a copy job has two phases, as clearly stated in the API documenation of virDomainBlockRebase()[3]: "[...] The job transitions to the second phase when the job info states cur == end, and remains alive to mirror all further changes to both source and destination. The user must call virDomainBlockJobAbort() to end the mirroring while choosing whether to revert to source or pivot to the destination. [...]"
Eric Blake's response from the upstream libvirt mailing list thread, when I asked for any thoughts on how to improve things here: http://www.redhat.com/archives/libvir-list/2016-September/msg00017.html Re-posting it here: On 09/01/2016 08:57 AM, Kashyap Chamarthy wrote: > So, I'm trying to understand how libvirt reports the "cur" and "end" > values. I've read the virDomainBlockJobInfo() struct, it wasn't crystal > clear. It states: > > /* > * The following fields provide an indication of block job progress. @cur > * indicates the current position and will be between 0 and @end. @end is > * the final cursor position for this operation and represents completion. > * To approximate progress, divide @cur by @end. > */ Libvirt is (more or less) reporting numbers from qemu. What's more, @end need not be the same between calls; qemu is free to change the end value as it comes up with more work to do (or sees that less work remains than initially estimated), all that REALLY matters is that the ratio between the two numbers converges, and is < 1 while busy, and == 1 when complete. Except that there are cases in qemu where a block job really has 0 work to do. Prior to qemu exposing the "ready":true/false flag, libvirt had to guess whether equal numbers really meant done, or if it merely meant nearly done. But now that we have "ready", I think the sanest course of action for libvirt is to fudge the numbers from qemu. After all, we've already documented (both in libvirt and in qemu) that @end is not fixed, so much as a moving target (it just doesn't move very much on operations where we have a good grasp on how much work remains from the start, like a deep copy; but is more prone to move on operations like live commit that are influenced by how active the guest is at writing data that we are attempting to commit at the same time). > > Now, if a job hasn't started, what would libvirt report? Talking to > Michal Privoznik on IRC: > > [mprivozn] > > I wonder if libvirt should report something else than start=0 end=0 > in order to tell the caller that job hasn't been started yet. I > suspect that libvirt does not report correctly whether job has > started already, which in turns forces Nova to use some workarounds. > > [kashyap] > > So, if the job hasn't started yet, what should libvirt report? > > [mprivozn] > > That's the question. We can't change the [virDomainBlockJobInfo] > struct (otherwise we won't be ABI compatible), so we can't really > add a bolean there 'bool job_started'. > > Or we can introduce new "job type" which wouldn't really be a job > type, but we will fill status.job with it to say explicitly job > hasn't started yet > > > So, any other thoughts here, on how to proceed here? My preference would be: If qemu doesn't report anything (because the job is not started yet), then libvirt should report cur=0, end=1 (the job still has 100% to go). If qemu reports 0/0 and "done":false, then libvirt should report cur=0, end=1 (that is, we fudge the end to be larger, because the job is not done yet). If qemu reports 0/0 and "done":true (because the job was really a no-op), then libvirt should report cur=1, end=1 (the job is 100% complete). If qemu reports 0/0 and lacks "done" (older qemu), then libvirt just has to guess. I'm not sure which guess is most appropriate; maybe libvirt itself will have to set up a timer and report 0/1 the first time, and only report 1/1 after a minimum time has elapsed, to make sure qemu has had a chance to do something about the job. Or maybe we don't worry about it, and just have libvirt report 0/0 because we really don't know any better. If qemu reports a/b, where a < b and b > 0, use those numbers as is. We don't even have to check "done". If qemu reports a/a, where a > 0, then also check "done". If "done":false is present, report a-1/a (the job is not quite done); if "done" is absent or "done":true is present, report a/a (the job is done). > I realize that a copy job has two phases, as clearly stated in the API > documenation of virDomainBlockRebase()[3]: > > "[...] The job transitions to the second phase when the job info > states cur == end, and remains alive to mirror all further changes to > both source and destination. The user must call > virDomainBlockJobAbort() to end the mirroring while choosing whether > to revert to source or pivot to the destination. [...]" We may also want to enhance the libvirt documentation to mention cur == non-zero end as the phase change point, mentioning that 0/0 is a special case (hopefully reserved for older qemu).
I've just pushed the patches upstream: commit 988218ca3f0de571d97d02365db9ffa0845c18da Author: Michal Privoznik <mprivozn> AuthorDate: Fri Sep 2 09:45:44 2016 +0200 Commit: Michal Privoznik <mprivozn> CommitDate: Wed Sep 14 12:44:42 2016 +0200 virDomainGetBlockJobInfo: Fix corner case when qemu reports no info https://bugzilla.redhat.com/show_bug.cgi?id=1372613 Apparently, some management applications use the following code pattern when waiting for a block job to finish: while (1) { virDomainGetBlockJobInfo(dom, disk, info, flags); if (info.cur == info.end) break; sleep(1); } Problem with this approach is in its corner cases. In case of QEMU, libvirt merely pass what has been reported on the monitor. However, if the block job hasn't started yet, qemu reports cur == end == 0 which tricks mgmt apps into thinking job is complete. The solution is to mangle cur/end values as described here [1]. 1: https://www.redhat.com/archives/libvir-list/2016-September/msg00017.html Signed-off-by: Michal Privoznik <mprivozn> commit 5d213b34de442490c66ee54b17d15e68cfeb2174 Author: Michal Privoznik <mprivozn> AuthorDate: Fri Sep 2 08:38:19 2016 +0200 Commit: Michal Privoznik <mprivozn> CommitDate: Wed Sep 14 12:44:42 2016 +0200 qemuDomainGetBlockJobInfo: Move info translation into separate func Even though we merely just pass to users whatever qemu provided on the monitor, we still do some translation. For instance we turn bytes into mebibytes, or fix job type if needed. However, in the future there is more fixing to be done so this code deserves its own function. Signed-off-by: Michal Privoznik <mprivozn>
Michal, is this fix available in 7.3? We need this for bug 1442266.
No. Also this is the upstream bugzilla.
Clearing up needinfo per comment 4.