Description of problem ---------------------- [Thanks to Eric Blake for the discussion on #virt, OFTC IRC, and for exposing this bug.] When virDomainBlockRebase() is invoked to start a copy job, then aborting the said copy operation with virDomainBlockJobAbort() + flag VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT can result in a potential race condition (due to the way the virDomainGetBlockJobInfo() reports the job status) where the pivot operation fails. Race condition window --------------------- libvirt finds cur==end AND sends a pivot request, all in the window before QEMU would have sent "ready": true field [emitted as part of the QMP `query-block-jobs` command's response, indicating that the job has actually completed], however the pivot request fails because it requires "ready": true. So Eric suggests: QEMU 2.0 or 1.x probably had a synchronous setup where you could never observer cur==end on a non-ready job. But I don't remember enough history to point to when QEMU switched jobs to be a bit more asynchronous. Maybe there was no qemu regression - maybe it was BECAUSE of other block-job additions in 2.2 that offset==len was no longer reliable. I don't know that for sure. But what it DOES sound like is that IF qemu reports "ready": false, offset==len is not reliable, and libvirt should be taught to fudge that. And hopefully, QEMU too old to report "ready:" at all is reliable with regards to offset==len, because that's all we have to go by. * * * The "ready" boolean field (for `query-block-jobs`) was introduced via QEMU commit ef6dbf1 (available in QEMU v2.2.0 or above): http://git.qemu.org/?p=qemu.git;a=commitdiff;h=ef6dbf1e4 -- blockjob: Add "ready" field And, libvirt was fixed to use the above field in this commit (available in libvirt v1.2.18 or above): libvirt.org/git/?p=libvirt.git;a=commitdiff;h=eae5924 -- qemu: Update state of block job to READY only if it actually is ready Additional Info --------------- [Related IRC conversation snippet from #virt, OFTC. Formatted for readability.] [eblake] Hmm, the libvirt commit (988218c - "virDomainGetBlockJobInfo: Fix corner case when qemu reports no info") you pointed to only fudges the 0==0 case. It is NOT fudging the cur==end, "ready": false case when cur != 0. I first have to be convinced that the race is a real problem - if QEMU reports len==offset and "ready": false, how much data can be lost? [kashyap] Data is not lost, but the pivot fails in case of blockRebase(). [eblake] So the race IS biting us. Libvirt finds cur==end AND sends a pivot request, all in the window before QEMU would have sent ready:true. And the pivot request fails because it requires "ready" :true. [kashyap] eblake: Exactly. [eblake] It would be nice if QEMU were fixed to never send cur==end with "ready": false. But since it does, it sounds like libvirt needs a BZ to fudge (easiest fudge is cur--) if "ready" :false
Related Info ------------ The below recent commit mangles the libvirt cur/end values for the case of cur == end == 0. (Noting Eric's emphasis from Description that : "It is NOT fudging the cur==end, "ready": false case when cur != 0") commit 988218ca3f0de571d97d02365db9ffa0845c18da Author: Michal Privoznik <mprivozn> Date: Fri Sep 2 09:45:44 2016 +0200 virDomainGetBlockJobInfo: Fix corner case when qemu reports no info https://bugzilla.redhat.com/show_bug.cgi?id=1372613 Apparently, some management applications use the following code pattern when waiting for a block job to finish: while (1) { virDomainGetBlockJobInfo(dom, disk, info, flags); if (info.cur == info.end) break; sleep(1); } Problem with this approach is in its corner cases. In case of QEMU, libvirt merely pass what has been reported on the monitor. However, if the block job hasn't started yet, qemu reports cur == end == 0 which tricks mgmt apps into thinking job is complete. The solution is to mangle cur/end values as described here [1]. 1: https://www.redhat.com/archives/libvir-list/2016-September/msg00017.html Signed-off-by: Michal Privoznik <mprivozn> [1] http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=988218c
The block job ready state is exposed via the domain XML for disk entries that have active copy jobs. Additionally libvirt provides async events that are triggered once the block job reaches ready state and successfully pivots as it is reported by qemu. This allows apps to figure out the precise moment when the pivot operation is safe and allow refresh the state in case there was potential downtime of the management app. As of such polling virDomainGetBlockJobInfo for other than progress information is not necessary.
Re-posting the response from Peter from the mailing list thread where he notes how the READY event is exposed via XML: https://www.redhat.com/archives/libvir-list/2016-October/msg00227.html --------------------------------------------------------------------- We expose the state of the copy job in the XML and forward the READY event from qemu to the users. A running copy job exposes itself in the xml as: <disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/> <backingStore/> <mirror type='file' file='/tmp/ble.img' format='raw' job='copy'> <format type='raw'/> <source file='/tmp/ble.img'/> </mirror> [...] </disk> While the ready copy job is exposed as: <disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/systemrescuecd-x86-4.8.0.iso'/> <backingStore/> <mirror type='file' file='/tmp/ble.img' format='raw' job='copy' ready='yes'> <format type='raw'/> <source file='/tmp/ble.img'/> </mirror> [...] </disk> ---------------------------------------------------------------------
[Given Eric's comments on the upstream mailing list thread, re-opening for further debate.] https://www.redhat.com/archives/libvir-list/2016-October/msg00249.html --------------------------------------------------------------------- On Thu, Oct 06, 2016 at 09:25:26AM -0500, Eric Blake wrote: > On Thu, Oct 06, 2016 at 01:34:59PM +0200, Peter Krempa wrote: >> On Thu, Oct 06, 2016 at 12:26:50 +0200, Kashyap Chamarthy wrote: >> QMP `query-block-jobs` command's response, indicating that the job has >> actually completed], however the pivot request fails because it requires >> "ready": true. > > The problem is that you are polling the block job info which correctly > reports that all data was properly copied and you are inferring the > block job state from that data. But the problem here is that qemu is NOT accurately reporting data - it is reporting cur==end with the promise that they are only equal if the job is stable, WHEN THE JOB IS NOT STABLE. > > I'm against deliberately reporting false data in the block info > structure. We are NOT falsifying any information, any more than we are falsifying information by changing cur/end to 0/1 when ready:false and qemu reported 0/0. (see commit 988218ca). > > The application should register handlers for the block job events and > act only if it receives such event. Additionally you may want to check > that the state is correct in the XML. The current block job information > structure can't be extended unfortunately. Yes, changing Nova to use event handlers is a good idea. But I'm ALSO in favor of fixing libvirt to work around the qemu bug, by intentionally munging the output to state cur<end (even if qemu reported cur==end) if qemu reports ready:false. ---------------------------------------------------------------------
Following from mailing list discussion, the (reluctant?) conclusion is to fix virDomainGetBlockJobInfo() "to quit reporting cur==end when ready:false." https://www.redhat.com/archives/libvir-list/2016-October/msg00287.html --------------------------------------------------------------------- On Fri, Oct 07, 2016 at 09:09:06AM +0200, Peter Krempa wrote: > On Thu, Oct 06, 2016 at 15:51:38 -0500, Eric Blake wrote: [...] > > The existing virDomainGetBlockJobInfo() can't be extended, but it can be > > fixed to quit reporting cur==end when ready:false. > > Yes, I agree about this one (although I don't really like it [1]), but > this one will actually fix software not listening for events without any > change. > > Any new API would not help since the apps would need to change anyways > thus can use the current correct approach right away even with older > libvirt versions. > > Peter > > [1]: I'm expecting users to start complaining: "Why is this last byte of > my image taking so long to copy after the rest copied pretty quickly". ---------------------------------------------------------------------
(In reply to Kashyap Chamarthy from comment #5) [...] > > [1]: I'm expecting users to start complaining: "Why is this last byte of > > my image taking so long to copy after the rest copied pretty quickly". Eric Blake writes[1]: <quote> And our response is "We never promised that cur and end are bytes, only rough status indicators. And we don't know why qemu is taking so long - move the bug report to them" - if the user can even see this state long enough for it to bother them. (Nova is hitting it, because it is a software-triggered reaction time, not a human in the mix; my understanding is that it is still at most a fraction of a second where we'd even have to do the fudging). </quote> [1] https://www.redhat.com/archives/libvir-list/2016-October/msg00347.html ---------------------------------------------------------------------
commit 38757744c2e06ace3dc95f31ac1ab0c9c45ba6b7 Author: Peter Krempa <pkrempa> Date: Mon Jan 21 16:01:57 2019 +0100 lib: domain: Emphasise that users should wait for block job READY state via events The transition to the ready state is best observed by events as it's ansynchronous and does not hint users to do polling. As currently only the qemu driver supports block copy and block commit and the ready state event was introduced by qemu 1.3 we can fully switch to the new approach.
commit 73ce3911aa0404f070468ae00c78369a5b1ac071 Author: Peter Krempa <pkrempa> Date: Tue Jan 29 17:17:29 2019 +0100 qemu: blockjob: Don't report block job progress at 100% if job isn't ready Some clients poll virDomainGetBlockJobInfo rather than wait for the VIR_DOMAIN_BLOCK_JOB_READY event. In some cases qemu can get to 100% and still not reach the synchronised phase. Initiating a pivot in that case will fail. Given that computers are interacting here, the error that the job can't be finalized yet is not handled very well by those specific implementations. Our docs now correctly state to use the event. We already do a similar output adjustment in case when the progress is not available from qemu as in that case we'd report 0 out of 0, which some apps also incorrectly considered as 100% complete. In this case we subtract 1 from the progress if the ready state is not signalled by qemu if the progress was at 100% otherwise. v5.0.0-180-g73ce3911aa