Bug 1227551
| Summary: | [blockcommit]Wrong blockcommit process with gluster based disk when the network connection was broken | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Pei Zhang <pzhang> | |
| Component: | libvirt | Assignee: | Peter Krempa <pkrempa> | |
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 7.2 | CC: | dyuan, mzhan, pkrempa, rbalakri, shyu, xuzhang, yanyang | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | libvirt-1.2.17-3.el7 | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1235004 (view as bug list) | Environment: | ||
| Last Closed: | 2015-11-19 06:39:45 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1200862, 1235004 | |||
| Bug Blocks: | ||||
Issue in step 4 is fixed with:
commit e7d3ff8464ed4833fa9c9bd9ef1f613f04434b31
Author: Peter Krempa <pkrempa>
Date: Fri Jun 19 15:43:02 2015 +0200
virsh: blockcopy: Report error if the copy job fails
When the block job would fail while watching it using the "--wait"
option for blockcopy, virsh would rather unhelpfully report:
$ virsh blockcopy vm hdc /tmp/raw.img --granularity 4096 --verbose --wait
Now in mirroring phase
Add a special case when the block job vanishes while waiting for it to
finish to improve the message:
$ virsh blockcopy vm hdc /tmp/raw.img --granularity 8192 --verbose --wait
error: Block Copy unexpectedly failed
v1.2.16-265-ge7d3ff8
I've filed https://bugzilla.redhat.com/show_bug.cgi?id=1235004 to track the issue described in comment 5 thus I'm moving this one to POST. Peter, The patch described in comment #2 only fixes blockcopy issue, but blockcommit and blockpull issues are not fixed. blockcommit and blockpull always report Block Commit: [100 %] Now in synchronized phase Repro steps: Scenario 1: disconnect to gluster server when blockcopy job is running -- PASSED 1.# virsh blockcopy vm3 vda /tmp/vm3.copy --wait --verbose Block Copy: [ 2 %] 2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP 3. check blockcopy status after seconds # virsh blockcopy vm3 vda /tmp/vm3.copy --wait --verbose Block Copy: [ 4 %]error: Block Copy unexpectedly failed Scenario 2: disconnet to gluster server when blockcommit job is running -- FAILED 1. # virsh blockcommit vm3 vda --active --wait --verbose Block Commit: [ 13 %] 2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP 3. check blockcommit status after a while # virsh blockcommit vm3 vda --active --wait --verbose Block Commit: [100 %] Now in synchronized phase # virsh dumpxml vm3 | grep mirror Scenario 3: disconnect to gluster server when blockpull job is running --FAILED 1. # virsh blockpull vm3 vda --wait --verbose Block Pull: [ 20 %] 2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP 3. check blockpull status after a while # virsh blockpull vm3 vda --wait --verbose Block Pull: [100 %] Pull complete Indeed, I'll follow up to fix all other uses. Fixed upstream:
commit faa143912381aa48e33839b194b32cc14d574589
Author: Peter Krempa <pkrempa>
Date: Mon Jul 13 17:04:49 2015 +0200
virsh: Refactor block job waiting in cmdBlockCopy
Similarly to the refactor of cmdBlockCommit in a previous commit this
does the same change for cmdBlockCopy.
commit 7408403560f7d054da75acaab855a95c51a92e2b
Author: Peter Krempa <pkrempa>
Date: Mon Jul 13 17:04:49 2015 +0200
virsh: Refactor block job waiting in cmdBlockCommit
Reuse the vshBlockJobWait infrastructure to refactor cmdBlockCommit to
use the common code. This additionally fixes a bug when working with
new qemus, where when doing an active commit with --pivot the pivoting
would fail, since qemu reaches 100% completion but the job doesn't
switch to synchronized phase right away.
commit 2e7827636476fdf976f17cd234b636973dedffc0
Author: Peter Krempa <pkrempa>
Date: Mon Jul 13 17:04:49 2015 +0200
virsh: Refactor block job waiting in cmdBlockPull
Introduce helper function that will provide logic for waiting for block
job completion so the 3 open coded places can be unified and improved.
This patch introduces the whole logic and uses it to fix
cmdBlockJobPull. The vshBlockJobWait function provides common logic for
block job waiting that should be robust enough to work across all
previous versions of libvirt. Since virsh allows passing user-provided
strings as paths of block devices we can't reliably use block job events
for detection of block job states so the function contains a great deal
of fallback logic.
commit eae59247c59aa02147b2b4a50177e8e877fdb218
Author: Peter Krempa <pkrempa>
Date: Wed Jul 15 15:11:02 2015 +0200
qemu: Update state of block job to READY only if it actually is ready
Few parts of the code looked at the current progress of and assumed that
a two phase blockjob is in the _READY state as soon as the progress
reached 100% (info.cur == info.end). In current versions of qemu this
assumption is invalid and qemu exposes a new flag 'ready' in the
query-block-jobs output that is set to true if the job is actually
finished.
This patch adds internal data handling for reading the 'ready' flag and
acting appropriately as long as the flag is present.
While this still doesn't fix the virsh client problem with two phase
block jobs and the --pivot option, it at least improves the error
message:
$ virsh blockcommit --wait --verbose vm vda --base vda[1] --active --pivot
Block commit: [100 %]error: failed to pivot job for disk vda
error: internal error: unable to execute QEMU command 'block-job-complete': The active block job for device 'drive-virtio-disk0' cannot be completed
to
$ virsh blockcommit --wait --verbose VM vda --base vda[1] --active --pivot
Block commit: [100 %]error: failed to pivot job for disk vda
error: block copy still active: disk 'vda' not ready for pivot yet
v1.2.17-142-gfaa1439
Verified with libvirt-1.2.17-3.el7.x86_64 and qemu-kvm-rhev-2.3.0-14.el7.x86_64 Scenario 1: disconnect to gluster server when blockcopy job is running -- PASSED 1.# virsh blockcopy vm3 vda /tmp/vm3.copy --wait --verbose Block Copy: [ 4 %] 2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP 3. check blockcopy status after seconds # virsh blockcopy vm3 vda /tmp/vm3.copy --wait --verbose Block Copy: [ 4 %] Copy failed Scenario 2: disconnet to gluster server when blockcommit job is running -- FAILED 1. # virsh blockcommit vm3 vda --active --wait --verbose Block Commit: [ 5 %] 2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP 3. check blockcommit status after a while # virsh blockcommit vm3 vda --active --wait --verbose Block Commit: [5 %] Commit failed # virsh dumpxml vm3 | grep mirror Scenario 3: disconnect to gluster server when blockpull job is running --FAILED 1. # virsh blockpull vm3 vda --wait --verbose Block Pull: [ 10 %] 2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP 3. check blockpull status after a while # virsh blockpull vm3 vda --wait --verbose Block Pull: [10 %] Pull failed Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html |
Description of problem: Broke the network connection during blockcommit , libvirt would report wrong info about the result of blockcommit and then it would fail to do blockcommit again . Version-Release number of selected component (if applicable): libvirt-1.2.15-2.el7.x86_64 qemu-kvm-rhev-2.3.0-1.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.prepare a healthy guest , and base image on gluster . #virsh dumpxml gluster | grep disk -A 9 <disk type='network' device='disk'> <driver name='qemu' type='qcow2'/> <source protocol='gluster' name='gluster-vol1/r7q2.img'> <host name='$server_IP'/> </source> <backingStore/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk> 2.create snapshots for this guest . # for i in {1..3}; do virsh snapshot-create-as gluster s$i --disk-only --diskspec vda,file=/tmp/s$i ; done Domain snapshot s1 created Domain snapshot s2 created Domain snapshot s3 created # virsh snapshot-list gluster Name Creation Time State ------------------------------------------------------------ s1 2015-06-01 16:42:46 +0800 disk-snapshot s2 2015-06-01 16:43:59 +0800 disk-snapshot s3 2015-06-01 16:44:12 +0800 disk-snapshot 3.do blockcommit in terminal 1 (do blockcommit): #virsh blockcommit gluster vda --active --verbose --wait Block Commit: [30 %] in terminal 2 (broke the network connection to gluster server ): # iptables -A OUTPUT -d $server_IP -j DROP 4.check result after a few minutes later : in terminal 1 (blockcommit finished): #virsh blockcommit gluster vda --active --verbose --wait Block Commit: [100 %] Now in synchronized phase # virsh blockjob gluster vda --info No current block job for vda check domain xml , it is not in mirror phase. #virsh dumpxml gluster | grep disk -A 15 <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/tmp/s3'/> <backingStore type='file' index='1'> <format type='qcow2'/> <source file='/tmp/s2'/> <backingStore type='file' index='2'> <format type='qcow2'/> <source file='/tmp/s1'/> <backingStore type='network' index='3'> <format type='qcow2'/> <source protocol='gluster' name='gluster-vol1/r7q2.img'> <host name='$server_IP'/> </source> <backingStore/> ...... 5.recover network connection and do blockcommit again . # iptables -D OUTPUT -d $server_IP -j DROP # virsh blockcommit gluster vda --active --verbose --wait error: internal error: unable to execute QEMU command 'block-commit': Error (Operation not permitted) flushing drive Actual results: As step 4 ,It reports that blockcommit finished and in mirror phase but actually it is not . As step 5, fail to do blockcommit again after recover the network connection. Expected results: In step4 give correct info about the result of blockcommit . In step 5 , It can do blockcommit successfully after recover the network connection. Additional info: