Bug 1022319

Summary:	Migration last for a long time when the guest is with glusterfs volume and the destination host can't access to this glusterfs volume
Product:	Red Hat Enterprise Linux 6	Reporter:	chhu
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED NOTABUG	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.6	CC:	ajia, dyuan, juzhang, mzhan, rbalakri, xuzhang, zpeng
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-02-24 13:28:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1024632

Description chhu 2013-10-23 04:17:00 UTC

Description of problem:
Migration last for a long time when the guest is with glusterfs volume and the destination host can't access to this glusterfs volume.

Version-Release number of selected component (if applicable):
libvirt-0.10.2-29.el6.x86_64
qemu-kvm-0.12.1.2-2.414.el6.x86_64
qemu-img-0.12.1.2-2.414.el6.x86_64
glusterfs-3.4.0.34rhs-1.el6.x86_64

How reproducible:
100%

Steps:
1. On glusterfs client A, start a guest with glusterfs volume
# virsh dumpxml r6-qcow2|grep disk -A 6
    <disk type='network' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source protocol='gluster' name='gluster-vol1/redhat6-qcow2.img'>
        <host name='10.66.82.251' port='24007'/>
      </source>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>

# virsh create r6-qcow2.xml
Domain r6-qcow2 created from r6-qcow2.xml

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 27    r6-qcow2                       running

2. do live migration to another glusterfs client B:
# virsh migrate --live --verbose r6-qcow2 qemu+ssh://10.66.82.145/system --unsafe --timeout 600
root.82.145's password:

open another terminal:
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 27    r6-qcow2                       running

3. check the domain status on destination host B:
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 7     r6-qcow2                       shut off

4. About 10 minutes later, the status on A changed to paused. But the "virsh migrate --live" command line is still running.
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 27    r6-qcow2                       paused


5. Hours later, The virsh command line return with error messages below:
# virsh migrate --live --verbose r6-qcow2 qemu+ssh://10.66.82.145/system --unsafe --timeout 600
root.82.145's password: 
2013-10-22 10:34:45.128+0000: 24692: info : libvirt version: 0.10.2, package: 29.el6 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2013-10-09-06:25:35, x86-026.build.eng.bos.redhat.com)
2013-10-22 10:34:45.128+0000: 24692: warning : virDomainMigrateVersion3:4922 : Guest r6-qcow2 probably left in 'paused' state on source
error: End of file while reading data: : Input/output error
error: One or more references were leaked after disconnect from the hypervisor
error: Reconnected to the hypervisor

6. check the domain status on destination host B:
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 3     r6-qcow2                       shut off
# virsh undefine r6-qcow2
error: Failed to undefine domain r6-qcow2
error: Requested operation is not valid: cannot undefine transient domain


Actual results:
In step2, the virsh migrate command line last for a long time.

Expected results:
In step2, the virsh migrate command line should return back with suitable error message when the timeout.

Comment 2 Jiri Denemark 2014-05-07 12:41:27 UTC

I was not able to reproduce this issue. If I block all traffic from gluster node to host B, the attempt to start new qemu process for incoming migration times out in about 2 minutes and migration is aborted. And libvirtd on destination correctly removes the domain.

So it seems like you might have blocked too much traffic (although even in that case I'm not confused how it could behave the way you describe), which should not make libvirt behave strangely but it's a slightly different scenario than lost connection to storage server.

I'll need more data from you since you did not provide a lot of details when filing this bug. What were the exact steps (i.e., how did you block the traffic, which host did you use to run virsh, etc.)? And please, provide debug logs from libvirtd and /var/log/libvirt/qemu/DOMAIN.log from both source and destination hosts whenever you file a bug which involves migration.

Comment 4 chhu 2014-05-30 09:35:50 UTC

I can reproduce this with:
libvirt-0.10.2-37.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64

1. On glusterfs client A, start a guest with glusterfs volume successfully, with packages:
libvirt-client-0.10.2-37.el6.x86_64
libvirt-0.10.2-37.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64
qemu-img-rhev-0.12.1.2-2.427.el6.x86_64
glusterfs-api-devel-3.4.0.59rhs-1.el6.x86_64
glusterfs-devel-3.4.0.59rhs-1.el6.x86_64
glusterfs-3.4.0.59rhs-1.el6.x86_64
glusterfs-rdma-3.4.0.59rhs-1.el6.x86_64
glusterfs-libs-3.4.0.59rhs-1.el6.x86_64
glusterfs-fuse-3.4.0.59rhs-1.el6.x86_64
glusterfs-api-3.4.0.59rhs-1.el6.x86_64

2. On the server B, only installed below related packages, so, it can't connect to the gluster volume.

libvirt-0.10.2-37.el6.x86_64
libvirt-client-0.10.2-37.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64
qemu-img-rhev-0.12.1.2-2.427.el6.x86_64
glusterfs-api-3.4.0.36rhs-1.el6.x86_64
glusterfs-libs-3.4.0.36rhs-1.el6.x86_64

3. Do migration from server A to server B, it'll last for a long time then end with error message as step5 described. This issue is caused by the mis-configuration on server B. But it lead to a long time wait on server A.

Comment 7 Jiri Denemark 2015-02-24 13:28:40 UTC

After playing with this for some time, I don't think there's any bug here. By default, virsh does not use keepalive protocol to detect broken connections and relies completely on TCP timeouts, which are much longer.

The strange shutoff state of a transient domain means the domain is stuck between shutoff and running state. In other words, we consider a domain shutoff until the corresponding qemu-kvm process starts and we can talk to its monitor and start guest CPUs. This is a bit annoying because the domain already has ID and is considered active when listing domains. I will fix this cosmetic issue upstream but I don't think we should change it in RHEL-6.