| Summary: | Migration last for a long time when the guest is with glusterfs volume and the destination host can't access to this glusterfs volume | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | chhu |
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
| Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.6 | CC: | ajia, dyuan, juzhang, mzhan, rbalakri, xuzhang, zpeng |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-02-24 13:28:40 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1024632 | ||
I was not able to reproduce this issue. If I block all traffic from gluster node to host B, the attempt to start new qemu process for incoming migration times out in about 2 minutes and migration is aborted. And libvirtd on destination correctly removes the domain. So it seems like you might have blocked too much traffic (although even in that case I'm not confused how it could behave the way you describe), which should not make libvirt behave strangely but it's a slightly different scenario than lost connection to storage server. I'll need more data from you since you did not provide a lot of details when filing this bug. What were the exact steps (i.e., how did you block the traffic, which host did you use to run virsh, etc.)? And please, provide debug logs from libvirtd and /var/log/libvirt/qemu/DOMAIN.log from both source and destination hosts whenever you file a bug which involves migration. I can reproduce this with: libvirt-0.10.2-37.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64 1. On glusterfs client A, start a guest with glusterfs volume successfully, with packages: libvirt-client-0.10.2-37.el6.x86_64 libvirt-0.10.2-37.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64 qemu-img-rhev-0.12.1.2-2.427.el6.x86_64 glusterfs-api-devel-3.4.0.59rhs-1.el6.x86_64 glusterfs-devel-3.4.0.59rhs-1.el6.x86_64 glusterfs-3.4.0.59rhs-1.el6.x86_64 glusterfs-rdma-3.4.0.59rhs-1.el6.x86_64 glusterfs-libs-3.4.0.59rhs-1.el6.x86_64 glusterfs-fuse-3.4.0.59rhs-1.el6.x86_64 glusterfs-api-3.4.0.59rhs-1.el6.x86_64 2. On the server B, only installed below related packages, so, it can't connect to the gluster volume. libvirt-0.10.2-37.el6.x86_64 libvirt-client-0.10.2-37.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64 qemu-img-rhev-0.12.1.2-2.427.el6.x86_64 glusterfs-api-3.4.0.36rhs-1.el6.x86_64 glusterfs-libs-3.4.0.36rhs-1.el6.x86_64 3. Do migration from server A to server B, it'll last for a long time then end with error message as step5 described. This issue is caused by the mis-configuration on server B. But it lead to a long time wait on server A. After playing with this for some time, I don't think there's any bug here. By default, virsh does not use keepalive protocol to detect broken connections and relies completely on TCP timeouts, which are much longer. The strange shutoff state of a transient domain means the domain is stuck between shutoff and running state. In other words, we consider a domain shutoff until the corresponding qemu-kvm process starts and we can talk to its monitor and start guest CPUs. This is a bit annoying because the domain already has ID and is considered active when listing domains. I will fix this cosmetic issue upstream but I don't think we should change it in RHEL-6. |
Description of problem: Migration last for a long time when the guest is with glusterfs volume and the destination host can't access to this glusterfs volume. Version-Release number of selected component (if applicable): libvirt-0.10.2-29.el6.x86_64 qemu-kvm-0.12.1.2-2.414.el6.x86_64 qemu-img-0.12.1.2-2.414.el6.x86_64 glusterfs-3.4.0.34rhs-1.el6.x86_64 How reproducible: 100% Steps: 1. On glusterfs client A, start a guest with glusterfs volume # virsh dumpxml r6-qcow2|grep disk -A 6 <disk type='network' device='disk'> <driver name='qemu' type='qcow2'/> <source protocol='gluster' name='gluster-vol1/redhat6-qcow2.img'> <host name='10.66.82.251' port='24007'/> </source> <target dev='vdb' bus='virtio'/> <alias name='virtio-disk1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> # virsh create r6-qcow2.xml Domain r6-qcow2 created from r6-qcow2.xml # virsh list --all Id Name State ---------------------------------------------------- 27 r6-qcow2 running 2. do live migration to another glusterfs client B: # virsh migrate --live --verbose r6-qcow2 qemu+ssh://10.66.82.145/system --unsafe --timeout 600 root.82.145's password: open another terminal: # virsh list --all Id Name State ---------------------------------------------------- 27 r6-qcow2 running 3. check the domain status on destination host B: # virsh list --all Id Name State ---------------------------------------------------- 7 r6-qcow2 shut off 4. About 10 minutes later, the status on A changed to paused. But the "virsh migrate --live" command line is still running. # virsh list --all Id Name State ---------------------------------------------------- 27 r6-qcow2 paused 5. Hours later, The virsh command line return with error messages below: # virsh migrate --live --verbose r6-qcow2 qemu+ssh://10.66.82.145/system --unsafe --timeout 600 root.82.145's password: 2013-10-22 10:34:45.128+0000: 24692: info : libvirt version: 0.10.2, package: 29.el6 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2013-10-09-06:25:35, x86-026.build.eng.bos.redhat.com) 2013-10-22 10:34:45.128+0000: 24692: warning : virDomainMigrateVersion3:4922 : Guest r6-qcow2 probably left in 'paused' state on source error: End of file while reading data: : Input/output error error: One or more references were leaked after disconnect from the hypervisor error: Reconnected to the hypervisor 6. check the domain status on destination host B: # virsh list --all Id Name State ---------------------------------------------------- 3 r6-qcow2 shut off # virsh undefine r6-qcow2 error: Failed to undefine domain r6-qcow2 error: Requested operation is not valid: cannot undefine transient domain Actual results: In step2, the virsh migrate command line last for a long time. Expected results: In step2, the virsh migrate command line should return back with suitable error message when the timeout.