Bug 807907 - Tunnelled migration sometimes report error when do scalability test
Tunnelled migration sometimes report error when do scalability test
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.3
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Jiri Denemark
Virtualization Bugs
:
Depends On:
Blocks: 807996
  Show dependency treegraph
 
Reported: 2012-03-29 02:10 EDT by weizhang
Modified: 2013-02-21 02:09 EST (History)
10 users (show)

See Also:
Fixed In Version: libvirt-0.9.13-3.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-21 02:09:21 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
libvirtd.log (8.32 KB, text/plain)
2012-03-29 02:10 EDT, weizhang
no flags Details
the pre 30s log on source (1.27 MB, text/plain)
2012-04-09 04:44 EDT, weizhang
no flags Details
log on target in the same period (1.23 MB, text/plain)
2012-04-09 04:44 EDT, weizhang
no flags Details
source libvirtd log when test on libvirt-0.9.13-3 (620.50 KB, application/octet-stream)
2012-07-27 04:35 EDT, weizhang
no flags Details
target libvirtd log when test on libvirt-0.9.13-3 (1.65 MB, application/octet-stream)
2012-07-27 04:37 EDT, weizhang
no flags Details

  None (edit)
Description weizhang 2012-03-29 02:10:28 EDT
Created attachment 573541 [details]
libvirtd.log

Description of problem:

I do scalability test with migrating 514 running guest (with 500 rhel6u3 guest and other 14 kinds of rhel5/rhel4/rhel3/windows guests) one by one with tcp --p2p --tunnelled, and in the end, sometimes it may report error
# time virsh migrate --live rhel5u8-i386 qemu+tcp://10.66.104.93/system --verbose --p2p --tunnelled
Migration: [ 43 %]error: internal error received hangup / error event on socket

real	0m32.483s
user	0m0.039s
sys	0m0.031s

But when do migration again, it may succeed
# time virsh migrate --live rhel5u8-i386 qemu+tcp://10.66.104.93/system --verbose --p2p --tunnelled
Migration: [100 %]

real	1m0.520s
user	0m0.067s
sys	0m0.042s

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.262.el6.x86_64
kernel-2.6.32-250.el6.x86_64
libvirt-0.9.10-9.el6.x86_64


How reproducible:
50%

Steps to Reproduce:
1. Start 512 guest (with 500 rhel6u3 guest and other 14 kinds of rhel5/rhel4/rhel3/windows guests) and do migration one by one with
# virsh migrate --live rhel5u8-i386 qemu+tcp://10.66.104.93/system --verbose --p2p --tunnelled
or you can change the address to ipv6 address
2. 
3.
  
Actual results:
sometime may report error
error: internal error received hangup / error event on socket

Expected results:
no error

Additional info:
Comment 1 Jiri Denemark 2012-04-03 08:22:43 EDT
Is the destination host heavily overloaded when this happens? According to the logs, libvirtd on destination host haven't sent any response to the source daemon within the last 30 seconds. The source thought the connection broke and aborted the migration. In any case, could you provide longer log file that would contain logs starting at least 30 seconds before the one you attached? And please, also attach libvirtd logs from the destination host covering the same period.
Comment 2 weizhang 2012-04-09 04:44:03 EDT
Created attachment 576136 [details]
the pre 30s log on source
Comment 3 weizhang 2012-04-09 04:44:49 EDT
Created attachment 576137 [details]
log on target in the same period
Comment 4 Jiri Denemark 2012-04-25 10:16:33 EDT
This caused by a bug in keepalive protocol which closes connection to destination in the middle of tunnelled migration. There are few other bugs causing the error message to be quite unhelpful.

Patches that fix all these issues were sent upstream:
https://www.redhat.com/archives/libvir-list/2012-April/msg01341.html
Comment 5 Jiri Denemark 2012-04-27 10:54:17 EDT
Patches sent for internal review: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-April/msg00855.html
Comment 8 weizhang 2012-05-04 03:43:23 EDT
Verify pass on 
libvirt-0.9.10-16.el6.x86_64
kernel-2.6.32-269.el6.x86_64
qemu-kvm-0.12.1.2-2.285.el6.x86_64


Easier steps:
1. Change the /etc/libvirt/qemu.conf with
keepalive_interval = 1
keepalive_count = 1

2. Start 20 guests with different type and do migration with
# for i in {1..20}; do virsh migrate-setspeed guest-$i 1; virsh migrate --live --tunnelled --p2p guest-$i qemu+tls://10.66.83.39/system 
one by one

can not show errors like 
error: internal error client socket is closed
or
error: internal error received hangup / error event on socket
Comment 9 Daniel Veillard 2012-05-23 09:53:07 EDT
We reverted that patch due to https://bugzilla.redhat.com/show_bug.cgi?id=821468

Daniel
Comment 11 Jiri Denemark 2012-06-13 10:11:04 EDT
This issue and several other issues in client RPC should properly be fixed by commits v0.9.12-189-g71689f9 through v0.9.12-201-g5d49060.
Comment 12 weizhang 2012-07-25 04:33:48 EDT
Test on 
libvirt-0.9.13-3.el6.x86_64
qemu-kvm-0.12.1.2-2.298.el6.x86_64
kernel-2.6.32-289.el6.x86_64

Still can find the error 
internal error received hangup / error event on socket

with steps

1. Prepare 10 guests and also occupy cpu resource on dest host with "cp", configure /etc/libvirt/libvirtd.conf with 
keepalive_interval = 1
keepalive_count = 1

2. Do migration with
# for i in {1..10}; do virsh migrate --live --p2p --tunnelled guest$i qemu+tcp://{target ip}/system --verbose --unsafe; done

some of them may have error like
error: internal error received hangup / error event on socket

so re-assign
Comment 13 Jiri Denemark 2012-07-25 09:55:44 EDT
This is not a correct test scenario. Setting keepalive_interval = 1 and keepalive_count = 1 means that every second the destination host is supposed to answer a keepalive request and the source host must process the response in less than a second. If this doesn't happen for any of these requests in time, the connection is closed. This can easily happen especially if the hosts or network is loaded.
Comment 14 weizhang 2012-07-25 22:33:33 EDT
(In reply to comment #13)
> This is not a correct test scenario. Setting keepalive_interval = 1 and
> keepalive_count = 1 means that every second the destination host is supposed
> to answer a keepalive request and the source host must process the response
> in less than a second. If this doesn't happen for any of these requests in
> time, the connection is closed. This can easily happen especially if the
> hosts or network is loaded.

Then how can I verify this bug with less guests? Change the /etc/libvirt/qemu.conf with
keepalive_interval = 1
keepalive_count = 1
as comment 8 but not /etc/libvirt/libvirtd.conf ? Or some other methods?
Comment 15 Jiri Denemark 2012-07-26 06:21:39 EDT
Setting keepalive_interval and keepalive_count to 1 is almost always a bad idea. This bug was caused mainly by lost keepalive messages (they were generated and queued but the RPC layer just discarded them instead sending them to remote party). Thus the number of keepalive requests generated on one side should match the number of requests received on the other side and also each generated request should eventually result in receiving keepalive response. Without prior to the fixes, the counts would differ. You can make dropping keepalive messages more likely by artificially reducing bandwidth on the network connection used for tunneled migration, setting keepalive_interval = 1 but increasing keepalive_count to, e.g., 30. To check the numbers, you can grep debug logs or use systemtap, which would work even better (in case you know how to use it; I don't know that).
Comment 16 weizhang 2012-07-27 04:32:09 EDT
I try with setting source /etc/libvirt/qemu.conf with
keepalive_interval = 1
keepalive_count = 30

and setting target /etc/libvirt/libvirtd.conf with
keepalive_interval = 1
keepalive_count = 30

Add heavy load on target with several "cp" loops, and 15+ guests running, start 1 guest on source host, also with 1 "cp" load, then do migration on source with

# virsh migrate-setspeed guest1 1; time virsh migrate --live --p2p --tunnelled guest1 qemu+tcp://10.66.4.196/system --unsafe --verbose
Migration: [ 73 %]error: internal error client socket is closed

real	7m15.656s
user	0m0.022s
sys	0m0.014s

I do 2 times and 1 time succeed 1 time failed with above information.

on the 1st migration, it succeed in 4 minutes without "no response from client after 30 keepalive messages" and the message I get on source with
# cat /var/log/libvirt/libvirtd.log |grep RPC_KEEPALIVE_SEND |grep "proc=1" | grep "ka=0x7f8b34002de0" |wc -l 

is same with on target
# cat /var/log/libvirt/libvirtd.log |grep RPC_KEEPALIVE_RECEIVED |grep "proc=1" |wc -l 

(proc=1 means request, proc=2 means response)

the count is 216 on both side.

For the second time migration, it failed with 
virKeepAliveTimerInternal:143 : No response from client 0x7f676401ba70 after 30 keepalive messages in 76 seconds
in libvirtd.log and  message get number are not same between source and target (source send 318, target receive 288)

The log for the second time will be attached later

version
libvirt-0.9.13-3.el6.x86_64
qemu-kvm-0.12.1.2-2.298.el6.x86_64
kernel-2.6.32-289.el6.x86_64
Comment 17 weizhang 2012-07-27 04:35:25 EDT
Created attachment 600718 [details]
source libvirtd log when test on libvirt-0.9.13-3

it is 27M so attached with bz2 packet
Comment 18 weizhang 2012-07-27 04:37:01 EDT
Created attachment 600719 [details]
target libvirtd log when test on libvirt-0.9.13-3

it is 71M so attached in bz2 package
Comment 20 Jiri Denemark 2012-12-04 09:02:48 EST
It's hard to give any advice here. Putting additional loads on the hosts make long delays more likely to happen and thus increasing the timeouts may be required. However, the bug should be reproducible even with a single migration with no additional load. Any tunnelled migration longer than 30 seconds (the configured timeout) would fail. You can slow migration down by artificially limiting network bandwidth, if needed.
Comment 21 weizhang 2012-12-06 06:48:31 EST
I test with limit the bandwidth with
# tc qdisc add dev eth0 root tbf rate 500kbit burst 1024kbit latency 50ms
on source and then do migration

in version libvirt-0.9.10-9
# virsh migrate --live --p2p --tunnelled rhel6u3 qemu+ssh://10.66.84.16/system --verbose
Migration: [  0 %]error: internal error received hangup / error event on socket

in version libvirt-0.9.13-3
# virsh migrate --live --p2p --tunnelled rhel6u3 qemu+ssh://10.66.84.16/system --verbose
Migration: [  0 %]error: operation failed: Lost connection to destination host


Does it mean the bug fixed?
Comment 22 Jiri Denemark 2012-12-10 06:03:58 EST
On which machine did you run the "tc ..." command? On which machine did you run virsh? What is the keepalive settings on all machines involved in the process?
Comment 23 weizhang 2012-12-10 06:45:27 EST
(In reply to comment #22)
> On which machine did you run the "tc ..." command? On which machine did you
> run virsh? What is the keepalive settings on all machines involved in the
> process?

I run 
# tc qdisc add dev eth0 root tbf rate 500kbit burst 1024kbit latency 50ms
on migration source machine and then run virsh migrate xxx, and no special keepalive settings on both machines
Comment 24 weizhang 2012-12-11 02:49:55 EST
Thanks for Jiri's help
Verify pass on libvirt-0.10.2-11.el6.x86_64

Limited the network with
# tc qdisc add dev eth0 root tbf rate 50Mbit latency 50ms burst 50000 mpu 64 mtu 150000


# netperf -H 10.66.84.16 -f M
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.84.16 (10.66.84.16) port 0 AF_INET : histogram : interval
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    MBytes/sec  

 87380  16384  16384    10.10       5.79 


Do migration with
#  time virsh migrate --live --p2p --tunnelled rhel6u3 qemu+tcp://10.66.84.16/system --verbose
Migration: [100 %]

real	1m6.611s
user	0m0.025s
sys	0m0.018s


Can reproduce on libvirt-0.9.10-9.el6.x86_64
# time virsh migrate --live --p2p --tunnelled rhel6u3 qemu+tcp://10.66.84.16/system --verbose
Migration: [ 78 %]error: internal error client socket is closed


real	0m32.085s
user	0m0.022s
sys	0m0.014s
Comment 25 errata-xmlrpc 2013-02-21 02:09:21 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html

Note You need to log in before you can comment on or make changes to this bug.