Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
virsh process is hung for storage migration for a long time and return a keepalive timeout error finally. This problem only happens in case network performance is not very well.
Version-Release number of selected component (if applicable):
libvirt-2.0.0-4.el7.x86_64
qemu-kvm-rhev-2.6.0-17.el7.x86_64
kernel-3.10.0-327.el7.x86_64
How reproducible:
100%
Steps to Reproduce:
1. Start the guest
2. Run command on source host:
# virsh migrate avocado-vt-vm1 --live --verbose --copy-storage-all qemu+ssh://10.73.72.104:22/system
3. Check image on remote host and found the disk size does not increase any more after 7.3G.
# qemu-img info jeos-23-64.qcow2
image: jeos-23-64.qcow2
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 7.3G
cluster_size: 65536
Format specific information:
compat: 0.10
refcount bits: 16
4. The virsh process is hung on source host.
5. Check log on target host and found timeout error. See attachment
6. After a long long time, the virsh client return an error:
error: Disconnected from qemu:///system due to keepalive timeout
7. To confirm it's caused by keepalive timeout, retest it after disable keepalive and the result is that migration succeeds.
Change: 'keepalive_interval = -1' in libvirtd.conf in both hosts.
Execute: virsh -K0 -k0 migrate avocado-vt-vm1 --live --verbose --copy-storage-all qemu+ssh://10.73.72.104:22/system
root.72.104's password:
Migration: [100 %]
Actual results:
See above.
Expected results:
Migration succeeds.
Additional info:
libvirtd log on target host:
2016-08-17 02:12:10.398+0000: 41879: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x7f1a659f8000 client=0x7f1a659f8990 countToDeath=0 idle=30
2016-08-17 02:12:10.398+0000: 41879: debug : virKeepAliveTimerInternal:141 : No response from client 0x7f1a659f8990 after 5 keepalive messages in 30 seconds
2016-08-17 02:12:10.398+0000: 41879: error : virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive timeout
Qemu log on target host:
2016-08-17 02:11:03.385+0000: 42027: debug : virExec:701 : Setting child security label to system_u:system_r:svirt_t:s0:c154,c841
2016-08-17 02:11:03.385+0000: 42027: debug : virExec:731 : Setting child uid:gid to 107:107 with caps 0
2016-08-17 02:11:03.386+0000: 42027: debug : virCommandHandshakeChild:433 : Notifying parent for handshake start on 30
2016-08-17 02:11:03.386+0000: 42027: debug : virCommandHandshakeChild:441 : Waiting on parent for handshake complete on 31
2016-08-17 02:11:03.396+0000: 42027: debug : virFileClose:102 : Closed fd 30
2016-08-17 02:11:03.396+0000: 42027: debug : virFileClose:102 : Closed fd 31
2016-08-17 02:11:03.396+0000: 42027: debug : virCommandHandshakeChild:461 : Handshake with parent is done
char device redirected to /dev/pts/1 (label charserial0)
2016-08-17T02:12:10.401583Z qemu-kvm: terminating on signal 15 from pid 41879
2016-08-17 02:14:01.372+0000: shutting down
I run into the similar problem when I add memory stress in guest and do live migration, I checked the debug log of libvirt API on source and the debug log of libvirtd on target, found that virsh client received keepalive request from target libvirtd every 5s and sent keepalive response, but libvirtd on target received the first keepalive response with about 6s latency, and never received the following responses and timed out after 30s.
I think it may be because there was too much network load on source host, which results in the "keepalive reponses" packages were sent with long time latency(longer than 30s), and thus the keepalive of target libvirtd timed out.
Comment 7Dr. David Alan Gilbert
2018-03-16 20:13:38 UTC
I think I've hit the same thing again myself; a fairly lightly stressed rhel6 guest and I've got a script running on virtlab413 driving migrate between virtlab402 and virtlab403. It happily managed >400 migrates with no guest stress, but I added a simple find / loop in the guest and I'm seeing:
error: Disconnected from qemu+ssh://root@vl402/system due to keepalive timeout
yet these machines are close to each other (same switch?) with the hosts otherwise idle.
We hit this same issue again in bug 1256213 and we were finally able to
reproduce the issue reliably enough to be able to investigate what's going on
here. The keep alive packets sent by the source libvirtd are stuck in the
kernel somewhere between ssh and the network interface on the source. The
kernel sends them eventually to the network, but it may be already too late.
I'm closing this as a duplicate of 1256213, where we have a lot more
information for further investigation.
*** This bug has been marked as a duplicate of bug 1256213 ***
Description of problem: virsh process is hung for storage migration for a long time and return a keepalive timeout error finally. This problem only happens in case network performance is not very well. Version-Release number of selected component (if applicable): libvirt-2.0.0-4.el7.x86_64 qemu-kvm-rhev-2.6.0-17.el7.x86_64 kernel-3.10.0-327.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Start the guest 2. Run command on source host: # virsh migrate avocado-vt-vm1 --live --verbose --copy-storage-all qemu+ssh://10.73.72.104:22/system 3. Check image on remote host and found the disk size does not increase any more after 7.3G. # qemu-img info jeos-23-64.qcow2 image: jeos-23-64.qcow2 file format: qcow2 virtual size: 10G (10737418240 bytes) disk size: 7.3G cluster_size: 65536 Format specific information: compat: 0.10 refcount bits: 16 4. The virsh process is hung on source host. 5. Check log on target host and found timeout error. See attachment 6. After a long long time, the virsh client return an error: error: Disconnected from qemu:///system due to keepalive timeout 7. To confirm it's caused by keepalive timeout, retest it after disable keepalive and the result is that migration succeeds. Change: 'keepalive_interval = -1' in libvirtd.conf in both hosts. Execute: virsh -K0 -k0 migrate avocado-vt-vm1 --live --verbose --copy-storage-all qemu+ssh://10.73.72.104:22/system root.72.104's password: Migration: [100 %] Actual results: See above. Expected results: Migration succeeds. Additional info: libvirtd log on target host: 2016-08-17 02:12:10.398+0000: 41879: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x7f1a659f8000 client=0x7f1a659f8990 countToDeath=0 idle=30 2016-08-17 02:12:10.398+0000: 41879: debug : virKeepAliveTimerInternal:141 : No response from client 0x7f1a659f8990 after 5 keepalive messages in 30 seconds 2016-08-17 02:12:10.398+0000: 41879: error : virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive timeout Qemu log on target host: 2016-08-17 02:11:03.385+0000: 42027: debug : virExec:701 : Setting child security label to system_u:system_r:svirt_t:s0:c154,c841 2016-08-17 02:11:03.385+0000: 42027: debug : virExec:731 : Setting child uid:gid to 107:107 with caps 0 2016-08-17 02:11:03.386+0000: 42027: debug : virCommandHandshakeChild:433 : Notifying parent for handshake start on 30 2016-08-17 02:11:03.386+0000: 42027: debug : virCommandHandshakeChild:441 : Waiting on parent for handshake complete on 31 2016-08-17 02:11:03.396+0000: 42027: debug : virFileClose:102 : Closed fd 30 2016-08-17 02:11:03.396+0000: 42027: debug : virFileClose:102 : Closed fd 31 2016-08-17 02:11:03.396+0000: 42027: debug : virCommandHandshakeChild:461 : Handshake with parent is done char device redirected to /dev/pts/1 (label charserial0) 2016-08-17T02:12:10.401583Z qemu-kvm: terminating on signal 15 from pid 41879 2016-08-17 02:14:01.372+0000: shutting down