Bug 725373
Summary: | [libvirt] when using domabortjob to abort stuck migration , the migration command still hangs. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Naori <dnaori> | ||||||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.1 | CC: | abaron, berrange, dallan, danken, dyuan, hateya, mgoldboi, mzhan, rwu, vbian, veillard, weizhan, wolfram, ykaul, yupzhang | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | libvirt-0.9.9-1.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
When a destination host lost network connectivity while a domain was being
migrated to it, the migration process could not be canceled. While the
cancellation request itself succeeded, following steps needed to make both
sides aware of the cancellation still blocked which prevented migration
protocol from finishing. In Red Hat Enterprise Linux 6.3, libvirt implements
internal keep alive protocol, which is able to detect broken connections or
blocked libvirt daemons (within 30 seconds with default configuration). When
such situation is detected during migration, libvirt automatically cancels it.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 799478 (view as bug list) | Environment: | |||||||||
Last Closed: | 2012-06-20 06:29:28 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 669581 | ||||||||||
Bug Blocks: | 723198, 773650, 773651, 773677, 773696, 799478 | ||||||||||
Attachments: |
|
I'm not 100% certain that it's the same behavior since I'm using different migration parameters, but I am able to reproduce this behavior with: virsh migrate --persistent --undefinesource --tunnelled --p2p --live --desturi 'qemu+ssh://root@hybrid0/system' spice There are several issues we're hitting in this scenario: - If virsh domjobabort is called when migration is not in the perform phase, it has no effect. - If the connectivity between src and dest machines is lost (using iptables in this case) while src qemu is still sending data to dest qemu, even domjobabort may hang because qemu blocks within migrate_cancel command. That's because it tries to flush its buffers first, which may block in send() because of full connection buffers. Blocked migrate_cancel command means that next attempt to send query-migrate command to qemu times out on job condition and perform phase fails. So this situation is only partly bad. - Finally, if we manage to get out of the perform phase, we call Finish API on dest, which hangs because we never see a response to it. The good thing is that it shouldn't hang forever. Only for the time until TCP connections time out. It took about 15 minutes for me to return to a normal state on src host. It needs more time for dest host to realize the connection was lost and that the domain should be automatically destroyed there. So I think we need three things: - make domjobabort smarter so that it can ideally cancel migration in any phase (I opened a new bug 728904 for this) - use fd: protocol for migration instead of relying on qemu to open a tcp connection to destination qemu and send data through it (tracked by bug 720269) - introduce in-protocol keep-alive messages so that dead peer (host or just libvirtd) can be detected faster Let's leave this BZ to track the last of the three things. (In reply to comment #4) > ... > - use fd: protocol for migration instead of relying on qemu to open a tcp > connection to destination qemu and send data through it (tracked by bug > 720269) This was based on my plans to make libvirtd a proxy between source and destination qemu in which case we could cancel the migration no matter what qemu does. The idea appeared not to be as good as I thought. Instead, the migrate_cancel command should be fixed in qemu to cancel the migration without blocking. Bug 669581 tracks the migrate_cancel bug in qemu. Series adding keepalive messages into libvirt RPC protocol was sent upstream for review: https://www.redhat.com/archives/libvir-list/2011-September/msg00938.html Support for keepalive messages is now upstream and will be included in 0.9.8 release. The series ends with commit v0.9.7-157-g2c4cdb7. Test with kernel-2.6.32-223.el6.x86_64 qemu-kvm-0.12.1.2-2.213.el6.x86_64 libvirt-0.9.9-1.el6.x86_64 1. Do migration #virsh migrate --live --p2p kvm-rhel6u2-x86_64-new qemu+tls://10.66.83.197/system At the same time, on other console do #iptables -A OUTPUT -d 10.66.83.197 -j REJECT Then do #virsh domjobabort kvm-rhel6u2-x86_64-new The migration job will not return immediately, but will wait for a while then return error: An error occurred, but the cause is unknown # virsh list Id Name State ---------------------------------- 8 kvm-rhel6u2-x86_64-new running But on target host, virsh list will hang forever. It that OK? Retest on kernel-2.6.32-225.el6.x86_64 libvirt-0.9.9-1.el6.x86_64 qemu-kvm-0.12.1.2-2.213.el6.x86_64 can not reproduce the phenomemon on comment 14, everything works well, so verify pass Interesting... In case you hit it again in the future, please attach debug logs from both source and destination libvirt daemons. And always be sure you run virsh domjobabort after migration data starts flowing from source to destination. Running it earlier doesn't provide expected results (see bug 728904). Hi Jiri, I reproduce the phenomemon on comment 14 again on kernel-2.6.32-230.el6.x86_64 qemu-kvm-0.12.1.2-2.231.el6.x86_64 libvirt-0.9.10-1.el6.x86_64 I will attach the debug log but it is a little big. Created attachment 564347 [details]
source libvirtd log
Created attachment 564348 [details]
dest libvirtd log
Hi Jiri, Do I need to re-assign the bug? still can reproduce the phenomemon on comment 14 on qemu-kvm-0.12.1.2-2.232.el6.x86_64 kernel-2.6.32-225.el6.x86_64 libvirt-0.9.10-3.el6.x86_64 so re-assign this bug Oh I see, what it is. virsh domjobabort correctly cancels the migration and source libvirtd tries to call Finish API on destination libvirtd to tell it about the abortion. Since all packets to destination libvirtd are discarded, the API waits for some time (~30 seconds) until the broken connection is detected and than source libvirtd correctly resumes the domain and reports the migration API finished. So far, everything worked as expected. However, it seems the error message was eaten somewhere on the way and virsh has nothing useful to report. Actually the eaten error message is a minor issue which doesn't need to block this bug. I'm moving this bz back to ON_QA and I'll create a new bug requesting the error message to be fixed. But another problem is that on target machine "virsh list" will hang, then do I need to file a new bug for it? Oh, I somehow missed that... Let me check quickly. OK, the part handled by this BZ seems to be working just fine. Destination libvirtd correctly detects the connection to the source is broken and starts to destroy the migrated domain. But libvirtd seems to hang while doing so. Any chance the destination host also lost connection to the shared storage used by the migrated domain? Perhaps because it used nfs from the source host? (In reply to comment #26) > OK, the part handled by this BZ seems to be working just fine. Destination > libvirtd correctly detects the connection to the source is broken and starts to > destroy the migrated domain. But libvirtd seems to hang while doing so. Any > chance the destination host also lost connection to the shared storage used by > the migrated domain? Perhaps because it used nfs from the source host? Yes, you are right. The virsh list hang is because I use the nfs on source host. With using the third host as nfs server, on target machine virsh list will not hang. But is that a new problem? Any way, this bug can be verified as domjobabort really abort the migration. So change the status to verified (In reply to comment #27) > (In reply to comment #26) > > OK, the part handled by this BZ seems to be working just fine. Destination > > libvirtd correctly detects the connection to the source is broken and starts to > > destroy the migrated domain. But libvirtd seems to hang while doing so. Any > > chance the destination host also lost connection to the shared storage used by > > the migrated domain? Perhaps because it used nfs from the source host? > > Yes, you are right. The virsh list hang is because I use the nfs on source > host. With using the third host as nfs server, on target machine virsh list > will not hang. But is that a new problem? Hmm, AFAIK we had similar bugs in the past, e.g., bug 746666, which is supposed to be fixed already. I think the safest way is to file a new bug for the hang you see. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When a destination host lost network connectivity while a domain was being migrated to it, the migration process could not be canceled. While the cancellation request itself succeeded, following steps needed to make both sides aware of the cancellation still blocked which prevented migration protocol from finishing. In Red Hat Enterprise Linux 6.3, libvirt implements internal keep alive protocol, which is able to detect broken connections or blocked libvirt daemons (within 30 seconds with default configuration). When such situation is detected during migration, libvirt automatically cancels it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0748.html |
Created attachment 515009 [details] libvirtd log Description of problem: when using domabortjob to abort stuck migration (destination host blocked during migration via iptables) , the migration command still hangs. Version-Release number of selected component (if applicable): libvirt-python-0.9.3-7.el6.x86_64 How reproducible: 100% Steps: [root@magenta-vds6 ~]# virsh -r list Id Name State ---------------------------------- 8 NFS-FOR-TEMP running [root@magenta-vds6 ~]# virsh migrate NFS-FOR-TEMP --live --p2p qemu+tls://10.35.116.1/system on other shell: [root@magenta-vds6 ~]# iptables -A OUTPUT -d 10.35.116.1 -j REJECT [root@magenta-vds6 ~]# virsh domjobabort NFS-FOR-TEMP Actual results: the migration command still hangs. Expected results: the migration should return. Additional info: libvirtd logs attached.