Bug 1022319
Summary: | Migration last for a long time when the guest is with glusterfs volume and the destination host can't access to this glusterfs volume | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | chhu |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.6 | CC: | ajia, dyuan, juzhang, mzhan, rbalakri, xuzhang, zpeng |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-02-24 13:28:40 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1024632 |
Description
chhu
2013-10-23 04:17:00 UTC
I was not able to reproduce this issue. If I block all traffic from gluster node to host B, the attempt to start new qemu process for incoming migration times out in about 2 minutes and migration is aborted. And libvirtd on destination correctly removes the domain. So it seems like you might have blocked too much traffic (although even in that case I'm not confused how it could behave the way you describe), which should not make libvirt behave strangely but it's a slightly different scenario than lost connection to storage server. I'll need more data from you since you did not provide a lot of details when filing this bug. What were the exact steps (i.e., how did you block the traffic, which host did you use to run virsh, etc.)? And please, provide debug logs from libvirtd and /var/log/libvirt/qemu/DOMAIN.log from both source and destination hosts whenever you file a bug which involves migration. I can reproduce this with: libvirt-0.10.2-37.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64 1. On glusterfs client A, start a guest with glusterfs volume successfully, with packages: libvirt-client-0.10.2-37.el6.x86_64 libvirt-0.10.2-37.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64 qemu-img-rhev-0.12.1.2-2.427.el6.x86_64 glusterfs-api-devel-3.4.0.59rhs-1.el6.x86_64 glusterfs-devel-3.4.0.59rhs-1.el6.x86_64 glusterfs-3.4.0.59rhs-1.el6.x86_64 glusterfs-rdma-3.4.0.59rhs-1.el6.x86_64 glusterfs-libs-3.4.0.59rhs-1.el6.x86_64 glusterfs-fuse-3.4.0.59rhs-1.el6.x86_64 glusterfs-api-3.4.0.59rhs-1.el6.x86_64 2. On the server B, only installed below related packages, so, it can't connect to the gluster volume. libvirt-0.10.2-37.el6.x86_64 libvirt-client-0.10.2-37.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.427.el6.x86_64 qemu-img-rhev-0.12.1.2-2.427.el6.x86_64 glusterfs-api-3.4.0.36rhs-1.el6.x86_64 glusterfs-libs-3.4.0.36rhs-1.el6.x86_64 3. Do migration from server A to server B, it'll last for a long time then end with error message as step5 described. This issue is caused by the mis-configuration on server B. But it lead to a long time wait on server A. After playing with this for some time, I don't think there's any bug here. By default, virsh does not use keepalive protocol to detect broken connections and relies completely on TCP timeouts, which are much longer. The strange shutoff state of a transient domain means the domain is stuck between shutoff and running state. In other words, we consider a domain shutoff until the corresponding qemu-kvm process starts and we can talk to its monitor and start guest CPUs. This is a bit annoying because the domain already has ID and is considered active when listing domains. I will fix this cosmetic issue upstream but I don't think we should change it in RHEL-6. |