807907 – Tunnelled migration sometimes report error when do scalability test

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 807907 - Tunnelled migration sometimes report error when do scalability test

Summary: Tunnelled migration sometimes report error when do scalability test

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	807996
TreeView+	depends on / blocked

Reported:	2012-03-29 06:10 UTC by weizhang
Modified:	2013-02-21 07:09 UTC (History)
CC List:	10 users (show)
Fixed In Version:	libvirt-0.9.13-3.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-02-21 07:09:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
libvirtd.log (8.32 KB, text/plain) 2012-03-29 06:10 UTC, weizhang	no flags	Details
the pre 30s log on source (1.27 MB, text/plain) 2012-04-09 08:44 UTC, weizhang	no flags	Details
log on target in the same period (1.23 MB, text/plain) 2012-04-09 08:44 UTC, weizhang	no flags	Details
source libvirtd log when test on libvirt-0.9.13-3 (620.50 KB, application/octet-stream) 2012-07-27 08:35 UTC, weizhang	no flags	Details
target libvirtd log when test on libvirt-0.9.13-3 (1.65 MB, application/octet-stream) 2012-07-27 08:37 UTC, weizhang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2013:0276	0	normal	SHIPPED_LIVE	Moderate: libvirt security, bug fix, and enhancement update	2013-02-20 21:18:26 UTC

Description weizhang 2012-03-29 06:10:28 UTC

Created attachment 573541 [details]
libvirtd.log

Description of problem:

I do scalability test with migrating 514 running guest (with 500 rhel6u3 guest and other 14 kinds of rhel5/rhel4/rhel3/windows guests) one by one with tcp --p2p --tunnelled, and in the end, sometimes it may report error
# time virsh migrate --live rhel5u8-i386 qemu+tcp://10.66.104.93/system --verbose --p2p --tunnelled
Migration: [ 43 %]error: internal error received hangup / error event on socket

real	0m32.483s
user	0m0.039s
sys	0m0.031s

But when do migration again, it may succeed
# time virsh migrate --live rhel5u8-i386 qemu+tcp://10.66.104.93/system --verbose --p2p --tunnelled
Migration: [100 %]

real	1m0.520s
user	0m0.067s
sys	0m0.042s

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.262.el6.x86_64
kernel-2.6.32-250.el6.x86_64
libvirt-0.9.10-9.el6.x86_64


How reproducible:
50%

Steps to Reproduce:
1. Start 512 guest (with 500 rhel6u3 guest and other 14 kinds of rhel5/rhel4/rhel3/windows guests) and do migration one by one with
# virsh migrate --live rhel5u8-i386 qemu+tcp://10.66.104.93/system --verbose --p2p --tunnelled
or you can change the address to ipv6 address
2. 
3.
  
Actual results:
sometime may report error
error: internal error received hangup / error event on socket

Expected results:
no error

Additional info:

Comment 1 Jiri Denemark 2012-04-03 12:22:43 UTC

Is the destination host heavily overloaded when this happens? According to the logs, libvirtd on destination host haven't sent any response to the source daemon within the last 30 seconds. The source thought the connection broke and aborted the migration. In any case, could you provide longer log file that would contain logs starting at least 30 seconds before the one you attached? And please, also attach libvirtd logs from the destination host covering the same period.

Comment 2 weizhang 2012-04-09 08:44:03 UTC

Created attachment 576136 [details]
the pre 30s log on source

Comment 3 weizhang 2012-04-09 08:44:49 UTC

Created attachment 576137 [details]
log on target in the same period

Comment 4 Jiri Denemark 2012-04-25 14:16:33 UTC

This caused by a bug in keepalive protocol which closes connection to destination in the middle of tunnelled migration. There are few other bugs causing the error message to be quite unhelpful.

Patches that fix all these issues were sent upstream:
https://www.redhat.com/archives/libvir-list/2012-April/msg01341.html

Comment 5 Jiri Denemark 2012-04-27 14:54:17 UTC

Patches sent for internal review: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-April/msg00855.html

Comment 8 weizhang 2012-05-04 07:43:23 UTC

Verify pass on 
libvirt-0.9.10-16.el6.x86_64
kernel-2.6.32-269.el6.x86_64
qemu-kvm-0.12.1.2-2.285.el6.x86_64


Easier steps:
1. Change the /etc/libvirt/qemu.conf with
keepalive_interval = 1
keepalive_count = 1

2. Start 20 guests with different type and do migration with
# for i in {1..20}; do virsh migrate-setspeed guest-$i 1; virsh migrate --live --tunnelled --p2p guest-$i qemu+tls://10.66.83.39/system 
one by one

can not show errors like 
error: internal error client socket is closed
or
error: internal error received hangup / error event on socket

Comment 9 Daniel Veillard 2012-05-23 13:53:07 UTC

We reverted that patch due to https://bugzilla.redhat.com/show_bug.cgi?id=821468

Daniel

Comment 11 Jiri Denemark 2012-06-13 14:11:04 UTC

This issue and several other issues in client RPC should properly be fixed by commits v0.9.12-189-g71689f9 through v0.9.12-201-g5d49060.

Comment 12 weizhang 2012-07-25 08:33:48 UTC

Test on 
libvirt-0.9.13-3.el6.x86_64
qemu-kvm-0.12.1.2-2.298.el6.x86_64
kernel-2.6.32-289.el6.x86_64

Still can find the error 
internal error received hangup / error event on socket

with steps

1. Prepare 10 guests and also occupy cpu resource on dest host with "cp", configure /etc/libvirt/libvirtd.conf with 
keepalive_interval = 1
keepalive_count = 1

2. Do migration with
# for i in {1..10}; do virsh migrate --live --p2p --tunnelled guest$i qemu+tcp://{target ip}/system --verbose --unsafe; done

some of them may have error like
error: internal error received hangup / error event on socket

so re-assign

Comment 13 Jiri Denemark 2012-07-25 13:55:44 UTC

This is not a correct test scenario. Setting keepalive_interval = 1 and keepalive_count = 1 means that every second the destination host is supposed to answer a keepalive request and the source host must process the response in less than a second. If this doesn't happen for any of these requests in time, the connection is closed. This can easily happen especially if the hosts or network is loaded.

Comment 14 weizhang 2012-07-26 02:33:33 UTC

(In reply to comment #13)
> This is not a correct test scenario. Setting keepalive_interval = 1 and
> keepalive_count = 1 means that every second the destination host is supposed
> to answer a keepalive request and the source host must process the response
> in less than a second. If this doesn't happen for any of these requests in
> time, the connection is closed. This can easily happen especially if the
> hosts or network is loaded.

Then how can I verify this bug with less guests? Change the /etc/libvirt/qemu.conf with
keepalive_interval = 1
keepalive_count = 1
as comment 8 but not /etc/libvirt/libvirtd.conf ? Or some other methods?

Comment 15 Jiri Denemark 2012-07-26 10:21:39 UTC

Setting keepalive_interval and keepalive_count to 1 is almost always a bad idea. This bug was caused mainly by lost keepalive messages (they were generated and queued but the RPC layer just discarded them instead sending them to remote party). Thus the number of keepalive requests generated on one side should match the number of requests received on the other side and also each generated request should eventually result in receiving keepalive response. Without prior to the fixes, the counts would differ. You can make dropping keepalive messages more likely by artificially reducing bandwidth on the network connection used for tunneled migration, setting keepalive_interval = 1 but increasing keepalive_count to, e.g., 30. To check the numbers, you can grep debug logs or use systemtap, which would work even better (in case you know how to use it; I don't know that).

Comment 16 weizhang 2012-07-27 08:32:09 UTC

I try with setting source /etc/libvirt/qemu.conf with
keepalive_interval = 1
keepalive_count = 30

and setting target /etc/libvirt/libvirtd.conf with
keepalive_interval = 1
keepalive_count = 30

Add heavy load on target with several "cp" loops, and 15+ guests running, start 1 guest on source host, also with 1 "cp" load, then do migration on source with

# virsh migrate-setspeed guest1 1; time virsh migrate --live --p2p --tunnelled guest1 qemu+tcp://10.66.4.196/system --unsafe --verbose
Migration: [ 73 %]error: internal error client socket is closed

real	7m15.656s
user	0m0.022s
sys	0m0.014s

I do 2 times and 1 time succeed 1 time failed with above information.

on the 1st migration, it succeed in 4 minutes without "no response from client after 30 keepalive messages" and the message I get on source with
# cat /var/log/libvirt/libvirtd.log |grep RPC_KEEPALIVE_SEND |grep "proc=1" | grep "ka=0x7f8b34002de0" |wc -l 

is same with on target
# cat /var/log/libvirt/libvirtd.log |grep RPC_KEEPALIVE_RECEIVED |grep "proc=1" |wc -l 

(proc=1 means request, proc=2 means response)

the count is 216 on both side.

For the second time migration, it failed with 
virKeepAliveTimerInternal:143 : No response from client 0x7f676401ba70 after 30 keepalive messages in 76 seconds
in libvirtd.log and  message get number are not same between source and target (source send 318, target receive 288)

The log for the second time will be attached later

version
libvirt-0.9.13-3.el6.x86_64
qemu-kvm-0.12.1.2-2.298.el6.x86_64
kernel-2.6.32-289.el6.x86_64

Comment 17 weizhang 2012-07-27 08:35:25 UTC

Created attachment 600718 [details]
source libvirtd log when test on libvirt-0.9.13-3

it is 27M so attached with bz2 packet

Comment 18 weizhang 2012-07-27 08:37:01 UTC

Created attachment 600719 [details]
target libvirtd log when test on libvirt-0.9.13-3

it is 71M so attached in bz2 package

Comment 20 Jiri Denemark 2012-12-04 14:02:48 UTC

It's hard to give any advice here. Putting additional loads on the hosts make long delays more likely to happen and thus increasing the timeouts may be required. However, the bug should be reproducible even with a single migration with no additional load. Any tunnelled migration longer than 30 seconds (the configured timeout) would fail. You can slow migration down by artificially limiting network bandwidth, if needed.

Comment 21 weizhang 2012-12-06 11:48:31 UTC

I test with limit the bandwidth with
# tc qdisc add dev eth0 root tbf rate 500kbit burst 1024kbit latency 50ms
on source and then do migration

in version libvirt-0.9.10-9
# virsh migrate --live --p2p --tunnelled rhel6u3 qemu+ssh://10.66.84.16/system --verbose
Migration: [  0 %]error: internal error received hangup / error event on socket

in version libvirt-0.9.13-3
# virsh migrate --live --p2p --tunnelled rhel6u3 qemu+ssh://10.66.84.16/system --verbose
Migration: [  0 %]error: operation failed: Lost connection to destination host


Does it mean the bug fixed?

Comment 22 Jiri Denemark 2012-12-10 11:03:58 UTC

On which machine did you run the "tc ..." command? On which machine did you run virsh? What is the keepalive settings on all machines involved in the process?

Comment 23 weizhang 2012-12-10 11:45:27 UTC

(In reply to comment #22)
> On which machine did you run the "tc ..." command? On which machine did you
> run virsh? What is the keepalive settings on all machines involved in the
> process?

I run 
# tc qdisc add dev eth0 root tbf rate 500kbit burst 1024kbit latency 50ms
on migration source machine and then run virsh migrate xxx, and no special keepalive settings on both machines

Comment 24 weizhang 2012-12-11 07:49:55 UTC

Thanks for Jiri's help
Verify pass on libvirt-0.10.2-11.el6.x86_64

Limited the network with
# tc qdisc add dev eth0 root tbf rate 50Mbit latency 50ms burst 50000 mpu 64 mtu 150000


# netperf -H 10.66.84.16 -f M
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.84.16 (10.66.84.16) port 0 AF_INET : histogram : interval
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.10       5.79 


Do migration with
#  time virsh migrate --live --p2p --tunnelled rhel6u3 qemu+tcp://10.66.84.16/system --verbose
Migration: [100 %]

real	1m6.611s
user	0m0.025s
sys	0m0.018s


Can reproduce on libvirt-0.9.10-9.el6.x86_64
# time virsh migrate --live --p2p --tunnelled rhel6u3 qemu+tcp://10.66.84.16/system --verbose
Migration: [ 78 %]error: internal error client socket is closed


real	0m32.085s
user	0m0.022s
sys	0m0.014s

Comment 25 errata-xmlrpc 2013-02-21 07:09:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html

Note You need to log in before you can comment on or make changes to this bug.