Bug 1119385
Summary: | The default behavor of abort block job with pivot flag isn't sync | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Eric Blake <eblake> | |
Component: | libvirt | Assignee: | Eric Blake <eblake> | |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 6.6 | CC: | dyuan, eblake, mjenner, mzhan, pkrempa, rbalakri, shyu, xuhj, zhwang | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | libvirt-0.10.2-41.el6 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | 1119173 | |||
: | 1119387 (view as bug list) | Environment: | ||
Last Closed: | 2014-10-14 04:23:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1119173 | |||
Bug Blocks: | 1119387 |
Description
Eric Blake
2014-07-14 16:25:49 UTC
Reproduce this bug with libvirt-0.10.2-40.el6.x86_64 # time virsh blockjob rhel6 vda --pivot real 0m0.231s user 0m0.024s sys 0m0.014s Verify with libvirt-0.10.2-41.el6.x86_64 # time virsh blockjob rhel6 vda --pivot real 0m32.676s user 0m0.022s sys 0m0.011s virDomainBlockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT flag became sync with this fix. So I would set this bug to VERIFIED status. Eric, need you confirm this before change to VERIFIED (In reply to Shanzhi Yu from comment #3) > Reproduce this bug with libvirt-0.10.2-40.el6.x86_64 > > # time virsh blockjob rhel6 vda --pivot > > real 0m0.231s > user 0m0.024s > sys 0m0.014s > > Verify with libvirt-0.10.2-41.el6.x86_64 > > # time virsh blockjob rhel6 vda --pivot > > real 0m32.676s > user 0m0.022s > sys 0m0.011s > > virDomainBlockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT flag became > sync with this fix. So I would set this bug to VERIFIED status. What you've listed merely shows that the command took longer (30 seconds longer!). But since libvirt is polling for the job completion every 50 ms at the API level and every 500ms at the virsh level, the only explanation for being this much longer is that you must have been doing something to artificially slow down qemu to not complete that fast. I'd feel a bit more comfortable knowing what you did to slow qemu down, so that we are sure libvirt isn't waiting too long. Verifying the bug might be easier if you can turn libvirt debug logs on, and prove that pre-patch, the completion event is occurring after control is returned to the user, while post-patch libvirt is polling job status before returning control to the user. Another idea: if you are artificially slowing down qemu's job completion, then both pre-patch and post-patch should show roughly the same amount of time between requesting the pivot and when an event is delivered (alas, the example python script that libvirt ships for listening to events was not coded to display blockjob events as of the RHEL build, and that also pre-dates 'virsh event'; so you'd have to do a bit of work coding your own event listener), but with a difference of whether the virsh command returns immediately or blocks until after the event. (In reply to Eric Blake from comment #4) > (In reply to Shanzhi Yu from comment #3) > > virDomainBlockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT flag became > > sync with this fix. So I would set this bug to VERIFIED status. > > What you've listed merely shows that the command took longer (30 seconds > longer!). But since libvirt is polling for the job completion every 50 ms > at the API level and every 500ms at the virsh level, the only explanation > for being this much longer is that you must have been doing something to > artificially slow down qemu to not complete that fast. I'd feel a bit more > comfortable knowing what you did to slow qemu down, so that we are sure > libvirt isn't waiting too long. > Yes. After blockcopy job is done(in mirroring phase), I login guest and touch a file by "dd if=/dev/zero of=/mnt/test.img bs=1M count=400), then I do "blockjob --pivot" > Verifying the bug might be easier if you can turn libvirt debug logs on, and > prove that pre-patch, the completion event is occurring after control is > returned to the user, while post-patch libvirt is polling job status before > returning control to the user. > After I turn debug logs on, with 0.10.2-41.el6.x86_64 I can see line "2014-07-27 09:39:40.104+0000: 33017: debug : qemuMonitorBlockJob:3046 : mon=0x7fbea4008610, device=drive-virtio-disk0, base=(null), bandwidth=0M, info=0x7fbeb95bb940, mode=1, modern=1" while with 0.10.2-40.el6.x86_64 I can't find "qemuMonitorBlockJob" related line. And There is debug log "query-block-jobs" in post-patch libvirt, while pre-patch there is so such logs. Should this also prove this patch works? If not, can you show me more details how to verify this bug? Thanks > Another idea: if you are artificially slowing down qemu's job completion, > then both pre-patch and post-patch should show roughly the same amount of > time between requesting the pivot and when an event is delivered (alas, the > example python script that libvirt ships for listening to events was not > coded to display blockjob events as of the RHEL build, and that also > pre-dates 'virsh event'; so you'd have to do a bit of work coding your own > event listener), but with a difference of whether the virsh command returns > immediately or blocks until after the event. (In reply to Shanzhi Yu from comment #5) > (In reply to Eric Blake from comment #4) > > (In reply to Shanzhi Yu from comment #3) > > > > virDomainBlockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT flag became > > > sync with this fix. So I would set this bug to VERIFIED status. > > > > What you've listed merely shows that the command took longer (30 seconds > > longer!). But since libvirt is polling for the job completion every 50 ms > > at the API level and every 500ms at the virsh level, the only explanation > > for being this much longer is that you must have been doing something to > > artificially slow down qemu to not complete that fast. I'd feel a bit more > > comfortable knowing what you did to slow qemu down, so that we are sure > > libvirt isn't waiting too long. > > > > Yes. After blockcopy job is done(in mirroring phase), I login guest and > touch a file by "dd if=/dev/zero of=/mnt/test.img bs=1M count=400), then I > do "blockjob --pivot" Okay, so you were using an I/O-intensive dd operation in the guest as something that would dramatically dirty the file system fast enough to cause a noticeable delay for the command completion. Nice trick - I'll have to use it myself in testing further patches. > > > Verifying the bug might be easier if you can turn libvirt debug logs on, and > > prove that pre-patch, the completion event is occurring after control is > > returned to the user, while post-patch libvirt is polling job status before > > returning control to the user. > > > > After I turn debug logs on, with 0.10.2-41.el6.x86_64 I can see line > "2014-07-27 09:39:40.104+0000: 33017: debug : qemuMonitorBlockJob:3046 : > mon=0x7fbea4008610, device=drive-virtio-disk0, base=(null), bandwidth=0M, > info=0x7fbeb95bb940, mode=1, modern=1" while with 0.10.2-40.el6.x86_64 I > can't find "qemuMonitorBlockJob" related line. > > And There is debug log "query-block-jobs" in post-patch libvirt, while > pre-patch there is so such logs. > > Should this also prove this patch works? If not, can you show me more > details how to verify this bug? Those log entries are good - they prove that libvirt was polling for job completion. I think you have verified the bug. Eric, Thanks Change this bug to VERIFIED status Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1374.html |