1367620 – storage migration fails due to keepalive timeout

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1367620 - storage migration fails due to keepalive timeout

Summary: storage migration fails due to keepalive timeout

Keywords:
Status:	CLOSED DUPLICATE of bug 1614182
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Fangge Jin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1558351
TreeView+	depends on / blocked

Reported:	2016-08-17 02:36 UTC by Dan Zheng
Modified:	2021-06-10 11:27 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-03 14:55:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vm log on source host (5.04 KB, text/plain) 2016-08-29 01:02 UTC, Dan Zheng	no flags	Details
vm log on target host (5.24 KB, text/plain) 2016-08-29 01:02 UTC, Dan Zheng	no flags	Details
libvirtd log on source host (8.45 MB, text/plain) 2016-08-29 01:04 UTC, Dan Zheng	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1325059	0	unspecified	CLOSED	Migration failed when stressing the guest	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1392756	0	unspecified	CLOSED	Migration fail with large dirty page in guest over ovs+dpdk	2021-02-22 00:41:40 UTC

Internal Links: 1325059 1392756

Description Dan Zheng 2016-08-17 02:36:13 UTC

Description of problem:

virsh process is hung for storage migration for a long time and return a keepalive timeout error finally. This problem only happens in case network performance is not very well.

Version-Release number of selected component (if applicable):
libvirt-2.0.0-4.el7.x86_64
qemu-kvm-rhev-2.6.0-17.el7.x86_64
kernel-3.10.0-327.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start the guest

2. Run command on source host:
# virsh migrate avocado-vt-vm1 --live --verbose  --copy-storage-all qemu+ssh://10.73.72.104:22/system

3. Check image on remote host and found the disk size does not increase any more after 7.3G.
# qemu-img info jeos-23-64.qcow2
image: jeos-23-64.qcow2
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 7.3G
cluster_size: 65536
Format specific information:
    compat: 0.10
    refcount bits: 16

4. The virsh process is hung on source host.

5. Check log on target host and found timeout error. See attachment

6. After a long long time, the virsh client return an error:
error: Disconnected from qemu:///system due to keepalive timeout

7. To confirm it's caused by keepalive timeout, retest it after disable keepalive and the result is that migration succeeds.
Change:  'keepalive_interval = -1' in libvirtd.conf in both hosts.
Execute: virsh -K0 -k0  migrate avocado-vt-vm1 --live --verbose  --copy-storage-all qemu+ssh://10.73.72.104:22/system
root.72.104's password:
Migration: [100 %]


Actual results:
See above.


Expected results:
Migration succeeds.

Additional info:

libvirtd log on target host:
2016-08-17 02:12:10.398+0000: 41879: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x7f1a659f8000 client=0x7f1a659f8990 countToDeath=0 idle=30
2016-08-17 02:12:10.398+0000: 41879: debug : virKeepAliveTimerInternal:141 : No response from client 0x7f1a659f8990 after 5 keepalive messages in 30 seconds
2016-08-17 02:12:10.398+0000: 41879: error : virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive timeout


Qemu log on target host:
2016-08-17 02:11:03.385+0000: 42027: debug : virExec:701 : Setting child security label to system_u:system_r:svirt_t:s0:c154,c841
2016-08-17 02:11:03.385+0000: 42027: debug : virExec:731 : Setting child uid:gid to 107:107 with caps 0
2016-08-17 02:11:03.386+0000: 42027: debug : virCommandHandshakeChild:433 : Notifying parent for handshake start on 30
2016-08-17 02:11:03.386+0000: 42027: debug : virCommandHandshakeChild:441 : Waiting on parent for handshake complete on 31
2016-08-17 02:11:03.396+0000: 42027: debug : virFileClose:102 : Closed fd 30
2016-08-17 02:11:03.396+0000: 42027: debug : virFileClose:102 : Closed fd 31
2016-08-17 02:11:03.396+0000: 42027: debug : virCommandHandshakeChild:461 : Handshake with parent is done
char device redirected to /dev/pts/1 (label charserial0)
2016-08-17T02:12:10.401583Z qemu-kvm: terminating on signal 15 from pid 41879
2016-08-17 02:14:01.372+0000: shutting down

Comment 2 Dan Zheng 2016-08-26 10:56:40 UTC

Another case also occurs the keepalive timeout issue.

Guest with the setting:
 <memoryBacking>    <hugepages/>  </memoryBacking>

virsh  migrate avocado-vt-vm1 --live --verbose --unsafe qemu+ssh://10.19.112.37:22/system
...
Migration: [  0 %]Migration: [ 13 %]Migration: [ 61 %]Migration: [ 71 %]Migration: [ 84 %]Migration: [ 94 %]Migration: [ 83 %]Migration: [ 59 %]Migration: [ 72 %]Migration: [ 84 %]Migration: [ 96 %]Migration: [ 89 %]Migration: [ 82 %]Migration: [ 94 %]Migration: [ 85 %]Migration: [ 76 %]Migration: [ 88 %]Migration: [ 81 %]Migration: [ 93 %]Migration: [ 93 %]Migration: [ 94 %]Migration: [ 96 %]Migration: [ 97 %]Migration: [ 86 %]Migration: [ 84 %]Migration: [ 84 %]Migration: [ 95 %]Migration: [ 95 %]Migration: [ 95 %]Migration: [ 94 %]Migration: [ 92 %]Migration: [ 95 %]Migration: [ 94 %]Migration: [ 96 %]Migration: [ 97 %]Migration: [ 87 %]Migration: [ 87 %]Migration: [ 81 %]Migration: [ 92 %]Migration: [ 88 %]Migration: [ 78 %]Migration: [ 91 %]Migration: [ 82 %]Migration: [ 95 %]Migration: [ 93 %]Migration: [ 93 %]Migration: [ 91 %]Migration: [ 91 %]Migration: [ 84 %]Migration: [ 98 %]Migration: [ 91 %]Migration: [ 72 %]Migration: [ 86 %]Migration: [ 83 %]Migration: [ 97 %]Migration: [ 95 %]Migration: [ 93 %]Migration: [ 94 %]Migration: [ 91 %]Migration: [ 90 %]error: operation failed: migration job: unexpectedly failed

But here no virsh client hang.
See log.

Comment 3 Dan Zheng 2016-08-29 01:02:13 UTC

Created attachment 1195116 [details]
vm log on source host

Comment 4 Dan Zheng 2016-08-29 01:02:46 UTC

Created attachment 1195117 [details]
vm log on target host

Comment 5 Dan Zheng 2016-08-29 01:04:41 UTC

Created attachment 1195118 [details]
libvirtd log on source host

Comment 6 Fangge Jin 2018-03-14 03:48:25 UTC

I run into the similar problem when I add memory stress in guest and do live migration, I checked the debug log of libvirt API on source and the debug log of libvirtd on target, found that virsh client received keepalive request from target libvirtd every 5s and sent keepalive response, but libvirtd on target received the first keepalive response with about 6s latency, and never received the following responses and timed out after 30s.
I think it may be because there was too much network load on source host, which results in the "keepalive reponses" packages were sent with long time latency(longer than 30s), and thus the keepalive of target libvirtd timed out.

Comment 7 Dr. David Alan Gilbert 2018-03-16 20:13:38 UTC

I think I've hit the same thing again myself; a fairly lightly stressed rhel6 guest and I've got a script running on virtlab413 driving migrate between virtlab402 and virtlab403.  It happily managed >400 migrates with no guest stress, but I added a simple find / loop in the guest and I'm seeing:

error: Disconnected from qemu+ssh://root@vl402/system due to keepalive timeout

yet these machines are close to each other (same switch?) with the hosts otherwise idle.

Comment 9 Jiri Denemark 2018-09-03 14:55:56 UTC

We hit this same issue again in bug 1256213 and we were finally able to
reproduce the issue reliably enough to be able to investigate what's going on
here. The keep alive packets sent by the source libvirtd are stuck in the
kernel somewhere between ssh and the network interface on the source. The
kernel sends them eventually to the network, but it may be already too late.

I'm closing this as a duplicate of 1256213, where we have a lot more
information for further investigation.

*** This bug has been marked as a duplicate of bug 1256213 ***

Comment 10 Jiri Denemark 2018-09-03 14:58:05 UTC

Oops, I pasted a wrong bug number three times :-(

The right one is bug 1614182.

*** This bug has been marked as a duplicate of bug 1614182 ***

Note You need to log in before you can comment on or make changes to this bug.