Hide Forgot
Description of problem: virsh process is hung for storage migration for a long time and return a keepalive timeout error finally. This problem only happens in case network performance is not very well. Version-Release number of selected component (if applicable): libvirt-2.0.0-4.el7.x86_64 qemu-kvm-rhev-2.6.0-17.el7.x86_64 kernel-3.10.0-327.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Start the guest 2. Run command on source host: # virsh migrate avocado-vt-vm1 --live --verbose --copy-storage-all qemu+ssh://10.73.72.104:22/system 3. Check image on remote host and found the disk size does not increase any more after 7.3G. # qemu-img info jeos-23-64.qcow2 image: jeos-23-64.qcow2 file format: qcow2 virtual size: 10G (10737418240 bytes) disk size: 7.3G cluster_size: 65536 Format specific information: compat: 0.10 refcount bits: 16 4. The virsh process is hung on source host. 5. Check log on target host and found timeout error. See attachment 6. After a long long time, the virsh client return an error: error: Disconnected from qemu:///system due to keepalive timeout 7. To confirm it's caused by keepalive timeout, retest it after disable keepalive and the result is that migration succeeds. Change: 'keepalive_interval = -1' in libvirtd.conf in both hosts. Execute: virsh -K0 -k0 migrate avocado-vt-vm1 --live --verbose --copy-storage-all qemu+ssh://10.73.72.104:22/system root.72.104's password: Migration: [100 %] Actual results: See above. Expected results: Migration succeeds. Additional info: libvirtd log on target host: 2016-08-17 02:12:10.398+0000: 41879: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x7f1a659f8000 client=0x7f1a659f8990 countToDeath=0 idle=30 2016-08-17 02:12:10.398+0000: 41879: debug : virKeepAliveTimerInternal:141 : No response from client 0x7f1a659f8990 after 5 keepalive messages in 30 seconds 2016-08-17 02:12:10.398+0000: 41879: error : virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive timeout Qemu log on target host: 2016-08-17 02:11:03.385+0000: 42027: debug : virExec:701 : Setting child security label to system_u:system_r:svirt_t:s0:c154,c841 2016-08-17 02:11:03.385+0000: 42027: debug : virExec:731 : Setting child uid:gid to 107:107 with caps 0 2016-08-17 02:11:03.386+0000: 42027: debug : virCommandHandshakeChild:433 : Notifying parent for handshake start on 30 2016-08-17 02:11:03.386+0000: 42027: debug : virCommandHandshakeChild:441 : Waiting on parent for handshake complete on 31 2016-08-17 02:11:03.396+0000: 42027: debug : virFileClose:102 : Closed fd 30 2016-08-17 02:11:03.396+0000: 42027: debug : virFileClose:102 : Closed fd 31 2016-08-17 02:11:03.396+0000: 42027: debug : virCommandHandshakeChild:461 : Handshake with parent is done char device redirected to /dev/pts/1 (label charserial0) 2016-08-17T02:12:10.401583Z qemu-kvm: terminating on signal 15 from pid 41879 2016-08-17 02:14:01.372+0000: shutting down
Another case also occurs the keepalive timeout issue. Guest with the setting: <memoryBacking> <hugepages/> </memoryBacking> virsh migrate avocado-vt-vm1 --live --verbose --unsafe qemu+ssh://10.19.112.37:22/system ... Migration: [ 0 %]Migration: [ 13 %]Migration: [ 61 %]Migration: [ 71 %]Migration: [ 84 %]Migration: [ 94 %]Migration: [ 83 %]Migration: [ 59 %]Migration: [ 72 %]Migration: [ 84 %]Migration: [ 96 %]Migration: [ 89 %]Migration: [ 82 %]Migration: [ 94 %]Migration: [ 85 %]Migration: [ 76 %]Migration: [ 88 %]Migration: [ 81 %]Migration: [ 93 %]Migration: [ 93 %]Migration: [ 94 %]Migration: [ 96 %]Migration: [ 97 %]Migration: [ 86 %]Migration: [ 84 %]Migration: [ 84 %]Migration: [ 95 %]Migration: [ 95 %]Migration: [ 95 %]Migration: [ 94 %]Migration: [ 92 %]Migration: [ 95 %]Migration: [ 94 %]Migration: [ 96 %]Migration: [ 97 %]Migration: [ 87 %]Migration: [ 87 %]Migration: [ 81 %]Migration: [ 92 %]Migration: [ 88 %]Migration: [ 78 %]Migration: [ 91 %]Migration: [ 82 %]Migration: [ 95 %]Migration: [ 93 %]Migration: [ 93 %]Migration: [ 91 %]Migration: [ 91 %]Migration: [ 84 %]Migration: [ 98 %]Migration: [ 91 %]Migration: [ 72 %]Migration: [ 86 %]Migration: [ 83 %]Migration: [ 97 %]Migration: [ 95 %]Migration: [ 93 %]Migration: [ 94 %]Migration: [ 91 %]Migration: [ 90 %]error: operation failed: migration job: unexpectedly failed But here no virsh client hang. See log.
Created attachment 1195116 [details] vm log on source host
Created attachment 1195117 [details] vm log on target host
Created attachment 1195118 [details] libvirtd log on source host
I run into the similar problem when I add memory stress in guest and do live migration, I checked the debug log of libvirt API on source and the debug log of libvirtd on target, found that virsh client received keepalive request from target libvirtd every 5s and sent keepalive response, but libvirtd on target received the first keepalive response with about 6s latency, and never received the following responses and timed out after 30s. I think it may be because there was too much network load on source host, which results in the "keepalive reponses" packages were sent with long time latency(longer than 30s), and thus the keepalive of target libvirtd timed out.
I think I've hit the same thing again myself; a fairly lightly stressed rhel6 guest and I've got a script running on virtlab413 driving migrate between virtlab402 and virtlab403. It happily managed >400 migrates with no guest stress, but I added a simple find / loop in the guest and I'm seeing: error: Disconnected from qemu+ssh://root@vl402/system due to keepalive timeout yet these machines are close to each other (same switch?) with the hosts otherwise idle.
We hit this same issue again in bug 1256213 and we were finally able to reproduce the issue reliably enough to be able to investigate what's going on here. The keep alive packets sent by the source libvirtd are stuck in the kernel somewhere between ssh and the network interface on the source. The kernel sends them eventually to the network, but it may be already too late. I'm closing this as a duplicate of 1256213, where we have a lot more information for further investigation. *** This bug has been marked as a duplicate of bug 1256213 ***
Oops, I pasted a wrong bug number three times :-( The right one is bug 1614182. *** This bug has been marked as a duplicate of bug 1614182 ***