Bug 822839
Summary: | virt-manager and virsh command hangs while the migration is happening and stop network on target host. | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | lei wang <leiwan> | |
Component: | libvirt | Assignee: | Martin Kletzander <mkletzan> | |
Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 6.3 | CC: | acathrow, cwei, dallan, dyuan, gsun, herrold, jdenemar, juzhou, leiwan, mzhan, veillard, weizhan | |
Target Milestone: | rc | Keywords: | Upstream | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 823467 1073506 (view as bug list) | Environment: | ||
Last Closed: | 2014-04-04 22:56:18 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 823467 |
Description
lei wang
2012-05-18 10:14:34 UTC
This is because neither virsh nor virt-manager support sending their own keep alive packets (they only response to keep alive messages sent by daemons). This is trivial addition, after making a connection, the client just needs to call virConnectSetKeepAlive() API. Since this cannot be done automatically (each client may want to set different parameters for the keep alive protocol), it needs to be fixed in the client. Let's keep this bug for the required change in virsh and make a clone of it for virt-manager. I could reproduce the bug sometimes, according to the log, the keepalive protocol works fine, the client could response the request from libvirtd. Currently, it seems not necessary to add active keepalive mechanism in the client part for this bug. The client just goes in a loop. The following is the client log in the loop: 2012-05-23 09:29:58.163+0000: 3534: debug : virNetMessageDecodeLength:149 : Got length, now need 28 total (24 more) 2012-05-23 09:29:58.163+0000: 3534: debug : virNetClientCallDispatch:926 : RPC_CLIENT_MSG_RX: client=0x7f7e8af7d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0 2012-05-23 09:29:58.163+0000: 3534: debug : virKeepAliveCheckMessage:408 : ka=0x199e3c0, client=0x7f7e8af7d010, msg=0x7f7e8af7d068 2012-05-23 09:29:58.163+0000: 3534: debug : virKeepAliveCheckMessage:424 : RPC_KEEPALIVE_RECEIVED: ka=0x199e3c0 client=0x7f7e8af7d010 prog=1801807216 vers=1 proc=1 2012-05-23 09:29:58.163+0000: 3534: debug : virKeepAliveCheckMessage:428 : Got keepalive request from client 0x7f7e8af7d010 2012-05-23 09:29:58.163+0000: 3534: debug : virKeepAliveScheduleResponse:143 : Scheduling keepalive response to client 0x7f7e8af7d010 2012-05-23 09:29:58.163+0000: 3534: debug : virNetMessageNew:48 : msg=0x7f7e7c083090 tracked=0 2012-05-23 09:29:58.163+0000: 3534: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28 2012-05-23 09:29:58.163+0000: 3534: debug : virEventPollUpdateTimeout:259 : EVENT_POLL_UPDATE_TIMEOUT: timer=1 frequency=0 2012-05-23 09:29:58.163+0000: 3534: debug : virEventPollInterruptLocked:706 : Interrupting 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollRunOnce:630 : Poll got 1 event(s) 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchTimeouts:415 : Dispatch 2 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchTimeouts:438 : EVENT_POLL_DISPATCH_TIMEOUT: timer=1 2012-05-23 09:29:58.163+0000: 3533: debug : virKeepAliveResponseTimer:213 : ka=0x199e3c0, client=0x7f7e8af7d010, response=0x7f7e7c083090 2012-05-23 09:29:58.163+0000: 3533: debug : virKeepAliveSend:119 : Sending keepalive response to client 0x7f7e8af7d010 2012-05-23 09:29:58.163+0000: 3533: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0x199e3c0 client=0x7f7e8af7d010 prog=1801807216 vers=1 proc=2 2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientSendInternal:1672 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7e8af7d010 len=28 prog=1801807216 vers=1 proc=2 type=2 status=0 serial=0 2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientIO:1530 : Outgoing message prog=1801807216 version=1 serial=0 proc=2 type=2 length=28 dispatch=0x7f7e7c000970 2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientIO:1547 : Going to sleep 0x7f7e7c000970 0x7f7e84000a50 2012-05-23 09:29:58.163+0000: 3534: debug : virNetClientIOEventLoop:1373 : Woken up from poll by other thread 2012-05-23 09:29:58.163+0000: 3534: debug : virNetClientIOEventLoop:1386 : New non-blocking call arrived; repolling 2012-05-23 09:29:58.163+0000: 3534: debug : virNetClientIOEventLoopRemoveDone:1165 : Waking up sleep 0x7f7e84000a50 2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientIO:1556 : Wokeup from sleep 0x7f7e7c000970 0x7f7e84000a50 2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientIO:1617 : All done with our call 0x7f7e7c000970 0x7f7e84000a50 2 2012-05-23 09:29:58.163+0000: 3533: debug : virNetMessageFree:75 : msg=0x7f7e7c083090 nfds=0 cb=(nil) 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollUpdateTimeout:259 : EVENT_POLL_UPDATE_TIMEOUT: timer=1 frequency=-1 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollInterruptLocked:702 : Skip interrupt, 1 -1952717056 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchHandles:460 : Dispatch 2 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchHandles:474 : i=0 w=1 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchHandles:488 : EVENT_POLL_DISPATCH_HANDLE: watch=1 events=1 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchHandles:474 : i=2 w=3 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCleanupTimeouts:506 : Cleanup 2 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCleanupHandles:554 : Cleanup 3 2012-05-23 09:29:58.163+0000: 3533: debug : virEventRunDefaultImpl:244 : running default event implementation 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCleanupTimeouts:506 : Cleanup 2 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCleanupHandles:554 : Cleanup 3 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollMakePollFDs:383 : Prepare n=0 w=1, f=4 e=1 d=0 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollMakePollFDs:383 : Prepare n=1 w=2, f=6 e=0 d=0 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollMakePollFDs:383 : Prepare n=2 w=3, f=11 e=1 d=0 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCalculateTimeout:325 : Calculate expiry of 2 timers 2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCalculateTimeout:351 : Timeout at 0 due in -1 ms During the process of running domainMigratePrepare3, run "service network stop" on source host, according to the log, the rpc will return wrong message with memory-messy header and the serial number is not right as well. Need to debug the libvirtd further. 2012-05-23 09:29:08.181+0000: 3534: debug : virNetClientSendInternal:1672 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7e8af7d010 len=396 prog=536903814 vers=1 proc=216 type=0 status=0 serial=10 2012-05-23 09:29:08.181+0000: 3534: debug : virNetClientIO:1530 : Outgoing message prog=536903814 version=1 serial=10 proc=216 type=0 length=396 dispatch=(nil) 2012-05-23 09:29:08.181+0000: 3534: debug : virNetClientIO:1588 : We have the buck 0x7f7e7c000970 0x7f7e7c000970 ... 2012-05-23 09:29:13.181+0000: 3534: debug : virNetClientCallDispatch:926 : RPC_CLIENT_MSG_RX: client=0x7f7e8af7d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0 The comment 4 is not right analysis. The client looks like hanging up, actually it is not, after the network is cut off, the client has no idea what happened and is just waiting forever(until tcp timeout), the keepalive message is still working between client and local libvirtd. But the keepalive communication between the client and target machine couldn't work anymore because of the network disconnection. So we got the following error after stoping the client forcedly by hand. 2012-05-18 06:24:09.019+0000: 25903: warning : virKeepAliveSend:128 : Failed to send keepalive response to client 0x7f23c0002f50 The comment 2 gave the right solution. For the non-p2p migration, with the help of keepalive protocol the client knows the disconnection to target machine, then the client should invoke virshDomainAbortJob() to abort the domainMigratePerform3 job that is running on source machine. The blocking qemu-kvm bug is that the qemu command "migrate_cancel" blocks and fails to cancel ongoing migration job if the network is disconnected. The problem exists on Fedora16 but the fix is verified by trying to use the newest qemu stream code. So patch will be soon. patch sent upstream https://www.redhat.com/archives/libvir-list/2012-June/msg00419.html This is a long-standing bug, can you confirm it still happens on 6.5? I would say there's a high possibility of it being fixed already. (In reply to Martin Kletzander from comment #14) > This is a long-standing bug, can you confirm it still happens on 6.5? I > would say there's a high possibility of it being fixed already. Hi Martin Kletzander, i also hit the problem "virsh comand hangs" on 6.5 system,but "virt-manager" works well as expected. Packages version: libvirt-0.10.2-29.el6_5.4.x86_64 virt-manager-0.9.0-19.el6.x86_64 python-virtinst-0.600.0-18.el6.noarch # uname -r 2.6.32-431.7.1.el6.x86_64 Setup NFS server and mounted it on source and target. Steps to Reproduce: 1. Start a domain; # virsh list Id Name State ---------------------------------------------------- 5 rhel6test running 2. Start a live migration; # virsh migrate rhel6test --live qemu+ssh://target/system --verbose root@target's password: 3. Open virt-manager connect two hosts for monitor the process of migration. 4. While the migration is happening, stop network on target host. # virsh migrate rhel6test --live qemu+ssh://target/system --verbose Migration: [ 92 %] # virsh list Id Name State ---------------------------------------------------- 6 rhel6test running (1)The migration command don't prompt any message and don't stop automatic if migration does not success, virsh command hangs. (2)virt-manager works well. How long did you wait for virsh to finish? Please try that again and let it settle for a while. The ssh connection must time out in order for the whole connection to time out. You can check it works by connecting to the machine by ssh before the migration and if that connection times out before the virsh one does than virsh has a problem, but there is no possibility of virsh to get stuck if there is no connection. I'm closing it as NOTABUG, feel free to reopen if virsh gets stuck for longer than the ssh you're connecting through. (In reply to Martin Kletzander from comment #17) Hi Martin Kletzander, I have followed your suggestion and tried again with following steps: 1. Start a domain on source host: # virsh list Id Name State ---------------------------------------------------- 5 rhel6test running the same time connect to the target host by ssh connection: # ssh root.85.74 -X root.85.74's password: 2. Start a live migration; # virsh migrate rhel6test --live qemu+ssh://target/system --verbose root@target's password: 3.While the migration is happening, stop network on target host. target host: #service network stop #date Thu Feb 27 10:39:07 CST 2014 source host: # virsh migrate --live rhel6test qemu+ssh://10.66.85.74/system --verbose root.85.74's password: Migration: [ 89 %] 4.after ssh connection to target host met timeout,check the result both on source host and target host: on target host: [root@intel-e31225-8-3 ~]# Write failed: Broken pipe #date Thu Feb 27 10:57:26 CST 2014 on source host: # virsh migrate --live rhel6test qemu+ssh://10.66.85.74/system --verbose root.85.74's password: Migration: [ 89 %]error: operation failed: migration job: unexpectedly failed #date Thu Feb 27 11:10:01 CST 2014 # virsh list --all Id Name State ---------------------------------------------------- 10 rhel6test running 5.compared the date time,it showed that virsh gets the error output longer than ssh. so what do you think about this result,let me know,thanks. (In reply to zhoujunqin from comment #18) I'd say this is a timeout that's not as reasonable as I was expecting. Let's keep it open then and I'll add the support for keepalive in virsh. Fixed upstream with commit v1.2.2-201-g676cb4f: commit 676cb4f4e762b8682a06c6dab1f690fbcd939550 Author: Martin Kletzander <mkletzan> Date: Thu Mar 6 17:20:11 2014 +0100 virsh: Add keepalive in new vshConnect function in series with optional commit v1.2.2-202-gb0cf7d6: commit b0cf7d64614ea1000424534ebbd5738d254c7410 Author: Martin Kletzander <mkletzan> Date: Fri Mar 7 11:15:39 2014 +0100 virsh: Prohibit virConnectOpen* functions in virsh Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. |