Created attachment 515009 [details]
Description of problem:
when using domabortjob to abort stuck migration (destination host blocked during migration via iptables) , the migration command still hangs.
Version-Release number of selected component (if applicable):
[root@magenta-vds6 ~]# virsh -r list
Id Name State
8 NFS-FOR-TEMP running
[root@magenta-vds6 ~]# virsh migrate NFS-FOR-TEMP --live --p2p qemu+tls://10.35.116.1/system
on other shell:
[root@magenta-vds6 ~]# iptables -A OUTPUT -d 10.35.116.1 -j REJECT
[root@magenta-vds6 ~]# virsh domjobabort NFS-FOR-TEMP
the migration command still hangs.
the migration should return.
libvirtd logs attached.
I'm not 100% certain that it's the same behavior since I'm using different migration parameters, but I am able to reproduce this behavior with:
virsh migrate --persistent --undefinesource --tunnelled --p2p --live --desturi 'qemu+ssh://root@hybrid0/system' spice
There are several issues we're hitting in this scenario:
- If virsh domjobabort is called when migration is not in the perform phase,
it has no effect.
- If the connectivity between src and dest machines is lost (using iptables in
this case) while src qemu is still sending data to dest qemu, even
domjobabort may hang because qemu blocks within migrate_cancel command.
That's because it tries to flush its buffers first, which may block in
send() because of full connection buffers. Blocked migrate_cancel command
means that next attempt to send query-migrate command to qemu times out on
job condition and perform phase fails. So this situation is only partly bad.
- Finally, if we manage to get out of the perform phase, we call Finish API on
dest, which hangs because we never see a response to it.
The good thing is that it shouldn't hang forever. Only for the time until TCP
connections time out. It took about 15 minutes for me to return to a normal
state on src host. It needs more time for dest host to realize the connection
was lost and that the domain should be automatically destroyed there.
So I think we need three things:
- make domjobabort smarter so that it can ideally cancel migration in any
phase (I opened a new bug 728904 for this)
- use fd: protocol for migration instead of relying on qemu to open a tcp
connection to destination qemu and send data through it (tracked by bug
- introduce in-protocol keep-alive messages so that dead peer (host or just
libvirtd) can be detected faster
Let's leave this BZ to track the last of the three things.
(In reply to comment #4)
> - use fd: protocol for migration instead of relying on qemu to open a tcp
> connection to destination qemu and send data through it (tracked by bug
This was based on my plans to make libvirtd a proxy between source and destination qemu in which case we could cancel the migration no matter what qemu does. The idea appeared not to be as good as I thought. Instead, the migrate_cancel command should be fixed in qemu to cancel the migration without blocking.
Bug 669581 tracks the migrate_cancel bug in qemu.
Series adding keepalive messages into libvirt RPC protocol was sent upstream for review: https://www.redhat.com/archives/libvir-list/2011-September/msg00938.html
Support for keepalive messages is now upstream and will be included in 0.9.8 release. The series ends with commit v0.9.7-157-g2c4cdb7.
1. Do migration
#virsh migrate --live --p2p kvm-rhel6u2-x86_64-new qemu+tls://10.66.83.197/system
At the same time, on other console do
#iptables -A OUTPUT -d 10.66.83.197 -j REJECT
#virsh domjobabort kvm-rhel6u2-x86_64-new
The migration job will not return immediately, but will wait for a while then return
error: An error occurred, but the cause is unknown
# virsh list
Id Name State
8 kvm-rhel6u2-x86_64-new running
But on target host, virsh list will hang forever. It that OK?
can not reproduce the phenomemon on comment 14, everything works well, so verify pass
Interesting... In case you hit it again in the future, please attach debug logs from both source and destination libvirt daemons. And always be sure you run virsh domjobabort after migration data starts flowing from source to destination. Running it earlier doesn't provide expected results (see bug 728904).
I reproduce the phenomemon on comment 14 again on
I will attach the debug log but it is a little big.
Created attachment 564347 [details]
source libvirtd log
Created attachment 564348 [details]
dest libvirtd log
Do I need to re-assign the bug?
still can reproduce the phenomemon on comment 14 on
so re-assign this bug
Oh I see, what it is. virsh domjobabort correctly cancels the migration and source libvirtd tries to call Finish API on destination libvirtd to tell it about the abortion. Since all packets to destination libvirtd are discarded, the API waits for some time (~30 seconds) until the broken connection is detected and than source libvirtd correctly resumes the domain and reports the migration API finished. So far, everything worked as expected. However, it seems the error message was eaten somewhere on the way and virsh has nothing useful to report.
Actually the eaten error message is a minor issue which doesn't need to block this bug. I'm moving this bz back to ON_QA and I'll create a new bug requesting the error message to be fixed.
But another problem is that on target machine "virsh list" will hang, then do I need to file a new bug for it?
Oh, I somehow missed that... Let me check quickly.
OK, the part handled by this BZ seems to be working just fine. Destination libvirtd correctly detects the connection to the source is broken and starts to destroy the migrated domain. But libvirtd seems to hang while doing so. Any chance the destination host also lost connection to the shared storage used by the migrated domain? Perhaps because it used nfs from the source host?
(In reply to comment #26)
> OK, the part handled by this BZ seems to be working just fine. Destination
> libvirtd correctly detects the connection to the source is broken and starts to
> destroy the migrated domain. But libvirtd seems to hang while doing so. Any
> chance the destination host also lost connection to the shared storage used by
> the migrated domain? Perhaps because it used nfs from the source host?
Yes, you are right. The virsh list hang is because I use the nfs on source host. With using the third host as nfs server, on target machine virsh list will not hang. But is that a new problem?
Any way, this bug can be verified as domjobabort really abort the migration. So change the status to verified
(In reply to comment #27)
> (In reply to comment #26)
> > OK, the part handled by this BZ seems to be working just fine. Destination
> > libvirtd correctly detects the connection to the source is broken and starts to
> > destroy the migrated domain. But libvirtd seems to hang while doing so. Any
> > chance the destination host also lost connection to the shared storage used by
> > the migrated domain? Perhaps because it used nfs from the source host?
> Yes, you are right. The virsh list hang is because I use the nfs on source
> host. With using the third host as nfs server, on target machine virsh list
> will not hang. But is that a new problem?
Hmm, AFAIK we had similar bugs in the past, e.g., bug 746666, which is supposed to be fixed already. I think the safest way is to file a new bug for the hang you see.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
When a destination host lost network connectivity while a domain was being
migrated to it, the migration process could not be canceled. While the
cancellation request itself succeeded, following steps needed to make both
sides aware of the cancellation still blocked which prevented migration
protocol from finishing. In Red Hat Enterprise Linux 6.3, libvirt implements
internal keep alive protocol, which is able to detect broken connections or
blocked libvirt daemons (within 30 seconds with default configuration). When
such situation is detected during migration, libvirt automatically cancels it.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.