Bug 807907
Summary: | Tunnelled migration sometimes report error when do scalability test | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | weizhang <weizhan> | ||||||||||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 6.3 | CC: | acathrow, dallan, dyasny, dyuan, gsun, honzhang, mzhan, rwu, syeghiay, veillard | ||||||||||||
Target Milestone: | rc | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | libvirt-0.9.13-3.el6 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2013-02-21 07:09:21 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 807996 | ||||||||||||||
Attachments: |
|
Is the destination host heavily overloaded when this happens? According to the logs, libvirtd on destination host haven't sent any response to the source daemon within the last 30 seconds. The source thought the connection broke and aborted the migration. In any case, could you provide longer log file that would contain logs starting at least 30 seconds before the one you attached? And please, also attach libvirtd logs from the destination host covering the same period. Created attachment 576136 [details]
the pre 30s log on source
Created attachment 576137 [details]
log on target in the same period
This caused by a bug in keepalive protocol which closes connection to destination in the middle of tunnelled migration. There are few other bugs causing the error message to be quite unhelpful. Patches that fix all these issues were sent upstream: https://www.redhat.com/archives/libvir-list/2012-April/msg01341.html Patches sent for internal review: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-April/msg00855.html Verify pass on libvirt-0.9.10-16.el6.x86_64 kernel-2.6.32-269.el6.x86_64 qemu-kvm-0.12.1.2-2.285.el6.x86_64 Easier steps: 1. Change the /etc/libvirt/qemu.conf with keepalive_interval = 1 keepalive_count = 1 2. Start 20 guests with different type and do migration with # for i in {1..20}; do virsh migrate-setspeed guest-$i 1; virsh migrate --live --tunnelled --p2p guest-$i qemu+tls://10.66.83.39/system one by one can not show errors like error: internal error client socket is closed or error: internal error received hangup / error event on socket We reverted that patch due to https://bugzilla.redhat.com/show_bug.cgi?id=821468 Daniel This issue and several other issues in client RPC should properly be fixed by commits v0.9.12-189-g71689f9 through v0.9.12-201-g5d49060. Test on libvirt-0.9.13-3.el6.x86_64 qemu-kvm-0.12.1.2-2.298.el6.x86_64 kernel-2.6.32-289.el6.x86_64 Still can find the error internal error received hangup / error event on socket with steps 1. Prepare 10 guests and also occupy cpu resource on dest host with "cp", configure /etc/libvirt/libvirtd.conf with keepalive_interval = 1 keepalive_count = 1 2. Do migration with # for i in {1..10}; do virsh migrate --live --p2p --tunnelled guest$i qemu+tcp://{target ip}/system --verbose --unsafe; done some of them may have error like error: internal error received hangup / error event on socket so re-assign This is not a correct test scenario. Setting keepalive_interval = 1 and keepalive_count = 1 means that every second the destination host is supposed to answer a keepalive request and the source host must process the response in less than a second. If this doesn't happen for any of these requests in time, the connection is closed. This can easily happen especially if the hosts or network is loaded. (In reply to comment #13) > This is not a correct test scenario. Setting keepalive_interval = 1 and > keepalive_count = 1 means that every second the destination host is supposed > to answer a keepalive request and the source host must process the response > in less than a second. If this doesn't happen for any of these requests in > time, the connection is closed. This can easily happen especially if the > hosts or network is loaded. Then how can I verify this bug with less guests? Change the /etc/libvirt/qemu.conf with keepalive_interval = 1 keepalive_count = 1 as comment 8 but not /etc/libvirt/libvirtd.conf ? Or some other methods? Setting keepalive_interval and keepalive_count to 1 is almost always a bad idea. This bug was caused mainly by lost keepalive messages (they were generated and queued but the RPC layer just discarded them instead sending them to remote party). Thus the number of keepalive requests generated on one side should match the number of requests received on the other side and also each generated request should eventually result in receiving keepalive response. Without prior to the fixes, the counts would differ. You can make dropping keepalive messages more likely by artificially reducing bandwidth on the network connection used for tunneled migration, setting keepalive_interval = 1 but increasing keepalive_count to, e.g., 30. To check the numbers, you can grep debug logs or use systemtap, which would work even better (in case you know how to use it; I don't know that). I try with setting source /etc/libvirt/qemu.conf with keepalive_interval = 1 keepalive_count = 30 and setting target /etc/libvirt/libvirtd.conf with keepalive_interval = 1 keepalive_count = 30 Add heavy load on target with several "cp" loops, and 15+ guests running, start 1 guest on source host, also with 1 "cp" load, then do migration on source with # virsh migrate-setspeed guest1 1; time virsh migrate --live --p2p --tunnelled guest1 qemu+tcp://10.66.4.196/system --unsafe --verbose Migration: [ 73 %]error: internal error client socket is closed real 7m15.656s user 0m0.022s sys 0m0.014s I do 2 times and 1 time succeed 1 time failed with above information. on the 1st migration, it succeed in 4 minutes without "no response from client after 30 keepalive messages" and the message I get on source with # cat /var/log/libvirt/libvirtd.log |grep RPC_KEEPALIVE_SEND |grep "proc=1" | grep "ka=0x7f8b34002de0" |wc -l is same with on target # cat /var/log/libvirt/libvirtd.log |grep RPC_KEEPALIVE_RECEIVED |grep "proc=1" |wc -l (proc=1 means request, proc=2 means response) the count is 216 on both side. For the second time migration, it failed with virKeepAliveTimerInternal:143 : No response from client 0x7f676401ba70 after 30 keepalive messages in 76 seconds in libvirtd.log and message get number are not same between source and target (source send 318, target receive 288) The log for the second time will be attached later version libvirt-0.9.13-3.el6.x86_64 qemu-kvm-0.12.1.2-2.298.el6.x86_64 kernel-2.6.32-289.el6.x86_64 Created attachment 600718 [details] source libvirtd log when test on libvirt-0.9.13-3 it is 27M so attached with bz2 packet Created attachment 600719 [details] target libvirtd log when test on libvirt-0.9.13-3 it is 71M so attached in bz2 package It's hard to give any advice here. Putting additional loads on the hosts make long delays more likely to happen and thus increasing the timeouts may be required. However, the bug should be reproducible even with a single migration with no additional load. Any tunnelled migration longer than 30 seconds (the configured timeout) would fail. You can slow migration down by artificially limiting network bandwidth, if needed. I test with limit the bandwidth with # tc qdisc add dev eth0 root tbf rate 500kbit burst 1024kbit latency 50ms on source and then do migration in version libvirt-0.9.10-9 # virsh migrate --live --p2p --tunnelled rhel6u3 qemu+ssh://10.66.84.16/system --verbose Migration: [ 0 %]error: internal error received hangup / error event on socket in version libvirt-0.9.13-3 # virsh migrate --live --p2p --tunnelled rhel6u3 qemu+ssh://10.66.84.16/system --verbose Migration: [ 0 %]error: operation failed: Lost connection to destination host Does it mean the bug fixed? On which machine did you run the "tc ..." command? On which machine did you run virsh? What is the keepalive settings on all machines involved in the process? (In reply to comment #22) > On which machine did you run the "tc ..." command? On which machine did you > run virsh? What is the keepalive settings on all machines involved in the > process? I run # tc qdisc add dev eth0 root tbf rate 500kbit burst 1024kbit latency 50ms on migration source machine and then run virsh migrate xxx, and no special keepalive settings on both machines Thanks for Jiri's help Verify pass on libvirt-0.10.2-11.el6.x86_64 Limited the network with # tc qdisc add dev eth0 root tbf rate 50Mbit latency 50ms burst 50000 mpu 64 mtu 150000 # netperf -H 10.66.84.16 -f M TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.84.16 (10.66.84.16) port 0 AF_INET : histogram : interval Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 16384 10.10 5.79 Do migration with # time virsh migrate --live --p2p --tunnelled rhel6u3 qemu+tcp://10.66.84.16/system --verbose Migration: [100 %] real 1m6.611s user 0m0.025s sys 0m0.018s Can reproduce on libvirt-0.9.10-9.el6.x86_64 # time virsh migrate --live --p2p --tunnelled rhel6u3 qemu+tcp://10.66.84.16/system --verbose Migration: [ 78 %]error: internal error client socket is closed real 0m32.085s user 0m0.022s sys 0m0.014s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0276.html |
Created attachment 573541 [details] libvirtd.log Description of problem: I do scalability test with migrating 514 running guest (with 500 rhel6u3 guest and other 14 kinds of rhel5/rhel4/rhel3/windows guests) one by one with tcp --p2p --tunnelled, and in the end, sometimes it may report error # time virsh migrate --live rhel5u8-i386 qemu+tcp://10.66.104.93/system --verbose --p2p --tunnelled Migration: [ 43 %]error: internal error received hangup / error event on socket real 0m32.483s user 0m0.039s sys 0m0.031s But when do migration again, it may succeed # time virsh migrate --live rhel5u8-i386 qemu+tcp://10.66.104.93/system --verbose --p2p --tunnelled Migration: [100 %] real 1m0.520s user 0m0.067s sys 0m0.042s Version-Release number of selected component (if applicable): qemu-kvm-0.12.1.2-2.262.el6.x86_64 kernel-2.6.32-250.el6.x86_64 libvirt-0.9.10-9.el6.x86_64 How reproducible: 50% Steps to Reproduce: 1. Start 512 guest (with 500 rhel6u3 guest and other 14 kinds of rhel5/rhel4/rhel3/windows guests) and do migration one by one with # virsh migrate --live rhel5u8-i386 qemu+tcp://10.66.104.93/system --verbose --p2p --tunnelled or you can change the address to ipv6 address 2. 3. Actual results: sometime may report error error: internal error received hangup / error event on socket Expected results: no error Additional info: