Bug 1227551

Summary: [blockcommit]Wrong blockcommit process with gluster based disk when the network connection was broken
Product: Red Hat Enterprise Linux 7 Reporter: Pei Zhang <pzhang>
Component: libvirtAssignee: Peter Krempa <pkrempa>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.2CC: dyuan, mzhan, pkrempa, rbalakri, shyu, xuzhang, yanyang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-1.2.17-3.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1235004 (view as bug list) Environment:
Last Closed: 2015-11-19 06:39:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1200862, 1235004    
Bug Blocks:    

Description Pei Zhang 2015-06-03 01:53:13 UTC
Description of problem:
Broke the network connection during blockcommit , libvirt would report wrong info about the result of blockcommit and then it would fail to do blockcommit again .

Version-Release number of selected component (if applicable):
libvirt-1.2.15-2.el7.x86_64
qemu-kvm-rhev-2.3.0-1.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.prepare a healthy guest , and base image on gluster .
#virsh dumpxml gluster | grep disk -A 9
<disk type='network' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source protocol='gluster' name='gluster-vol1/r7q2.img'>
        <host name='$server_IP'/>
      </source>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>

2.create snapshots for this guest .
# for i in {1..3}; do virsh snapshot-create-as gluster s$i --disk-only --diskspec vda,file=/tmp/s$i ; done
Domain snapshot s1 created
Domain snapshot s2 created
Domain snapshot s3 created

# virsh snapshot-list gluster
 Name                 Creation Time             State
------------------------------------------------------------
 s1                   2015-06-01 16:42:46 +0800 disk-snapshot
 s2                   2015-06-01 16:43:59 +0800 disk-snapshot
 s3                   2015-06-01 16:44:12 +0800 disk-snapshot

3.do blockcommit
in terminal 1 (do blockcommit):
#virsh blockcommit gluster vda --active --verbose --wait
Block Commit: [30 %]

in terminal 2 (broke the network connection to gluster server ):
#  iptables -A OUTPUT -d $server_IP -j DROP

4.check result after a few minutes later :
in terminal 1 (blockcommit finished):
#virsh blockcommit gluster vda --active --verbose --wait
Block Commit: [100 %]
Now in synchronized phase

# virsh blockjob gluster vda --info
No current block job for vda

check domain xml , it is not in mirror phase.

#virsh dumpxml gluster | grep disk -A 15
<disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/tmp/s3'/>
      <backingStore type='file' index='1'>
        <format type='qcow2'/>
        <source file='/tmp/s2'/>
        <backingStore type='file' index='2'>
          <format type='qcow2'/>
          <source file='/tmp/s1'/>
          <backingStore type='network' index='3'>
            <format type='qcow2'/>
            <source protocol='gluster' name='gluster-vol1/r7q2.img'>
              <host name='$server_IP'/>
            </source>
            <backingStore/>
......

5.recover network connection and do blockcommit again .
#  iptables -D OUTPUT -d $server_IP -j DROP

# virsh blockcommit gluster vda --active --verbose --wait
error: internal error: unable to execute QEMU command 'block-commit': Error (Operation not permitted) flushing drive


Actual results:
As step 4 ,It reports that blockcommit finished and in mirror phase but actually it is not .
As step 5, fail to do blockcommit again after recover the network connection.

Expected results:
In step4 give correct info about the result of blockcommit .
In step 5 , It can do blockcommit successfully after recover the network connection.


Additional info:

Comment 2 Peter Krempa 2015-06-23 17:38:25 UTC
Issue in step 4 is fixed with:
commit e7d3ff8464ed4833fa9c9bd9ef1f613f04434b31
Author: Peter Krempa <pkrempa>
Date:   Fri Jun 19 15:43:02 2015 +0200

    virsh: blockcopy: Report error if the copy job fails
    
    When the block job would fail while watching it using the "--wait"
    option for blockcopy, virsh would rather unhelpfully report:
    
    $ virsh blockcopy vm hdc /tmp/raw.img --granularity 4096 --verbose --wait
    
    Now in mirroring phase
    
    Add a special case when the block job vanishes while waiting for it to
    finish to improve the message:
    
    $ virsh blockcopy vm hdc /tmp/raw.img --granularity 8192 --verbose --wait
    error: Block Copy unexpectedly failed

v1.2.16-265-ge7d3ff8

Comment 3 Peter Krempa 2015-06-23 17:50:31 UTC
I've filed https://bugzilla.redhat.com/show_bug.cgi?id=1235004 to track the issue described in comment 5 thus I'm moving this one to POST.

Comment 5 Yang Yang 2015-07-08 09:44:11 UTC
Peter,

The patch described in comment #2 only fixes blockcopy issue, but blockcommit and blockpull issues are not fixed. blockcommit and blockpull always report
Block Commit: [100 %]
Now in synchronized phase

Repro steps:
Scenario 1: disconnect to gluster server when blockcopy job is running -- PASSED
1.# virsh blockcopy vm3 vda /tmp/vm3.copy --wait --verbose
Block Copy: [  2 %]

2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP

3. check blockcopy status after seconds
# virsh blockcopy vm3 vda /tmp/vm3.copy --wait --verbose
Block Copy: [  4 %]error: Block Copy unexpectedly failed

Scenario 2: disconnet to gluster server when blockcommit job is running -- FAILED
1. # virsh blockcommit vm3 vda --active --wait --verbose
Block Commit: [ 13 %]

2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP

3. check blockcommit status after a while
# virsh blockcommit vm3 vda --active --wait --verbose
Block Commit: [100 %]
Now in synchronized phase

# virsh dumpxml vm3 | grep mirror

Scenario 3: disconnect to gluster server when blockpull job is running --FAILED
1. # virsh blockpull vm3 vda --wait --verbose
Block Pull: [ 20 %]

2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP

3. check blockpull status after a while
# virsh blockpull vm3 vda --wait --verbose
Block Pull: [100 %]
Pull complete

Comment 6 Peter Krempa 2015-07-08 14:21:54 UTC
Indeed, I'll follow up to fix all other uses.

Comment 7 Peter Krempa 2015-07-21 13:44:18 UTC
Fixed upstream:

commit faa143912381aa48e33839b194b32cc14d574589
Author: Peter Krempa <pkrempa>
Date:   Mon Jul 13 17:04:49 2015 +0200

    virsh: Refactor block job waiting in cmdBlockCopy
    
    Similarly to the refactor of cmdBlockCommit in a previous commit this
    does the same change for cmdBlockCopy.

commit 7408403560f7d054da75acaab855a95c51a92e2b
Author: Peter Krempa <pkrempa>
Date:   Mon Jul 13 17:04:49 2015 +0200

    virsh: Refactor block job waiting in cmdBlockCommit
    
    Reuse the vshBlockJobWait infrastructure to refactor cmdBlockCommit to
    use the common code. This additionally fixes a bug when working with
    new qemus, where when doing an active commit with --pivot the pivoting
    would fail, since qemu reaches 100% completion but the job doesn't
    switch to synchronized phase right away.

commit 2e7827636476fdf976f17cd234b636973dedffc0
Author: Peter Krempa <pkrempa>
Date:   Mon Jul 13 17:04:49 2015 +0200

    virsh: Refactor block job waiting in cmdBlockPull
    
    Introduce helper function that will provide logic for waiting for block
    job completion so the 3 open coded places can be unified and improved.
    
    This patch introduces the whole logic and uses it to fix
    cmdBlockJobPull. The vshBlockJobWait function provides common logic for
    block job waiting that should be robust enough to work across all
    previous versions of libvirt. Since virsh allows passing user-provided
    strings as paths of block devices we can't reliably use block job events
    for detection of block job states so the function contains a great deal
    of fallback logic.

commit eae59247c59aa02147b2b4a50177e8e877fdb218
Author: Peter Krempa <pkrempa>
Date:   Wed Jul 15 15:11:02 2015 +0200

    qemu: Update state of block job to READY only if it actually is ready
    
    Few parts of the code looked at the current progress of and assumed that
    a two phase blockjob is in the _READY state as soon as the progress
    reached 100% (info.cur == info.end). In current versions of qemu this
    assumption is invalid and qemu exposes a new flag 'ready' in the
    query-block-jobs output that is set to true if the job is actually
    finished.
    
    This patch adds internal data handling for reading the 'ready' flag and
    acting appropriately as long as the flag is present.
    
    While this still doesn't fix the virsh client problem with two phase
    block jobs and the --pivot option, it at least improves the error
    message:
    
    $ virsh blockcommit  --wait --verbose vm vda  --base vda[1] --active --pivot
    Block commit: [100 %]error: failed to pivot job for disk vda
    error: internal error: unable to execute QEMU command 'block-job-complete': The active block job for device 'drive-virtio-disk0' cannot be completed
    
    to
    
    $ virsh blockcommit  --wait --verbose VM vda  --base vda[1] --active --pivot
    Block commit: [100 %]error: failed to pivot job for disk vda
    error: block copy still active: disk 'vda' not ready for pivot yet


v1.2.17-142-gfaa1439

Comment 9 Yang Yang 2015-08-04 10:08:33 UTC
Verified with libvirt-1.2.17-3.el7.x86_64 and qemu-kvm-rhev-2.3.0-14.el7.x86_64

Scenario 1: disconnect to gluster server when blockcopy job is running -- PASSED
1.# virsh blockcopy vm3 vda /tmp/vm3.copy --wait --verbose
Block Copy: [  4 %]

2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP

3. check blockcopy status after seconds
# virsh blockcopy vm3 vda /tmp/vm3.copy --wait --verbose
Block Copy: [  4 %]
Copy failed

Scenario 2: disconnet to gluster server when blockcommit job is running -- FAILED
1. # virsh blockcommit vm3 vda --active --wait --verbose
Block Commit: [ 5 %]

2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP

3. check blockcommit status after a while
# virsh blockcommit vm3 vda --active --wait --verbose
Block Commit: [5 %]
Commit failed

# virsh dumpxml vm3 | grep mirror

Scenario 3: disconnect to gluster server when blockpull job is running --FAILED
1. # virsh blockpull vm3 vda --wait --verbose
Block Pull: [ 10 %]

2. # iptables -A OUTPUT -d 10.66.4.164 -j DROP

3. check blockpull status after a while
# virsh blockpull vm3 vda --wait --verbose
Block Pull: [10 %]
Pull failed

Comment 11 errata-xmlrpc 2015-11-19 06:39:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2202.html