Bug 822839 - virt-manager and virsh command hangs while the migration is happening and stop network on target host.
virt-manager and virsh command hangs while the migration is happening and sto...
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.3
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Martin Kletzander
Virtualization Bugs
: Upstream
Depends On:
Blocks: 823467
  Show dependency treegraph
 
Reported: 2012-05-18 06:14 EDT by lei wang
Modified: 2014-04-04 18:56 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 823467 1073506 (view as bug list)
Environment:
Last Closed: 2014-04-04 18:56:18 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description lei wang 2012-05-18 06:14:34 EDT
Description of problem:
virt-manager and virsh command hangs while the migration is happening and stop network on target host.

Version-Release number of selected component (if applicable):
# rpm -qa |grep libvirt
libvirt-debuginfo-0.9.10-14.el6.x86_64
libvirt-0.9.10-20.el6.x86_64
libvirt-devel-0.9.10-20.el6.x86_64
libvirt-client-0.9.10-20.el6.x86_64
libvirt-python-0.9.10-20.el6.x86_64
# rpm -q virt-manager
virt-manager-0.9.0-14.el6.x86_64
# rpm -q python-virtinst
python-virtinst-0.600.0-8.el6.noarch
# uname -r
2.6.32-272.el6.x86_64
#

How reproducible:
100%

procadion 
Configure migration environment with below command on both src and dst host.
#iptables -F
#setenforce 1
#setsebool virt_use_nfs  on
#mount 10.66.90.121:/vol/S3/libvirtauto /var/lib/libvirt/images/ -o vers=3

Configure ssh environment
# ssh-keygen
# ssh-copy-id root@target ip

Steps to Reproduce:
1. Start a domain;
   # virsh start <domain>

2. Start a live migration;
   # virsh migrate --live <domain> qemu+ssh://host2/system

3. Open virt-manager connect two hosts for monitor the process of migration.

4. While the migration is happening, stop network on target host.

  
Actual results:
1.The migration command don't prompt any message and don't stop automatic if migration does not success, virsh command hangs.
2.virt-manager hangs

Expected results:
1.The migration should fail, and could prompt some message about what happened
2.The virt-manager works well

Additional info:
If start network on target host, the below message shown, and virt-manager disconnect from all hosts.

[root@test ~]# virsh migrate --live mig-10 qemu+ssh://leiwan.redhat.com/system
2012-05-18 06:24:09.019+0000: 25903: info : libvirt version: 0.9.10, package: 20.el6 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2012-05-15-21:58:29, x86-007.build.bos.redhat.com)
2012-05-18 06:24:09.019+0000: 25903: warning : virKeepAliveSend:128 : Failed to send keepalive response to client 0x7f23c0002f50
error: operation failed: migration job: unexpectedly failed
error: Reconnected to the hypervisor

If use command '#virsh domjobabort <domain>' to cancel migration job, the virsh command could NOT stop directly.

virt-manager Log:

# virt-manager --debug
2012-05-18 14:49:19,603 (cli:71): virt-manager startup
2012-05-18 14:49:19,603 (virt-manager:292): Launched as: /usr/share/virt-manager/virt-manager.py --debug
2012-05-18 14:49:19,603 (virt-manager:293): GTK version: (2, 18, 9)
2012-05-18 14:49:19,603 (virt-manager:294): virtManager import: <module 'virtManager' from '/usr/share/virt-manager/virtManager/__init__.pyc'>
2012-05-18 14:49:19,684 (engine:555): No inspection thread because libguestfs is too old, not available, or libvirt is not thread safe.
2012-05-18 14:49:19,685 (engine:346): About to connect to uris ['qemu+ssh://root@leiwan.redhat.com/system', 'qemu:///system']
2012-05-18 14:49:19,689 (virt-manager:350): Connected to already running instance.

# cat virt-manager.log
[Fri, 18 May 2012 15:00:40 virt-manager 25231] ERROR (engine:453) Could not refresh connection qemu+ssh://root@leiwan.redhat.com/system.
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/engine.py", line 440, in _tick
    conn.tick()
  File "/usr/share/virt-manager/virtManager/connection.py", line 1418, in tick
    self.hostinfo = self.vmm.getInfo()
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3175, in getInfo
    if ret is None: raise libvirtError ('virNodeGetInfo() failed', conn=self)
libvirtError: End of file while reading data: : Input/output error
[Fri, 18 May 2012 15:00:40 virt-manager 25231] DEBUG (engine:454) Closing connection since libvirtd appears to have stopped.
[Fri, 18 May 2012 15:00:40 virt-manager 25231] ERROR (engine:453) Could not refresh connection qemu:///system.
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/engine.py", line 440, in _tick
    conn.tick()
  File "/usr/share/virt-manager/virtManager/connection.py", line 1418, in tick
    self.hostinfo = self.vmm.getInfo()
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3175, in getInfo
    if ret is None: raise libvirtError ('virNodeGetInfo() failed', conn=self)
libvirtError: Cannot write data: Broken pipe
[Fri, 18 May 2012 15:00:40 virt-manager 25231] DEBUG (engine:454) Closing connection since libvirtd appears to have stopped.
[Fri, 18 May 2012 15:00:40 virt-manager 25231] ERROR (cli:85) Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/manager.py", line 975, in vm_selected
    self.change_run_text(vm.hasSavedImage())
  File "/usr/share/virt-manager/virtManager/domain.py", line 1089, in hasSavedImage
    return self._backend.hasManagedSaveImage(0)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 738, in hasManagedSaveImage
    if ret == -1: raise libvirtError ('virDomainHasManagedSaveImage() failed', dom=self)
libvirtError: Cannot write data: Broken pipe
None
[Fri, 18 May 2012 15:00:40 virt-manager 25231] ERROR (cli:85) Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/manager.py", line 912, in conn_state_changed
    self.vm_selected()
  File "/usr/share/virt-manager/virtManager/manager.py", line 975, in vm_selected
    self.change_run_text(vm.hasSavedImage())
  File "/usr/share/virt-manager/virtManager/domain.py", line 1089, in hasSavedImage
    return self._backend.hasManagedSaveImage(0)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 738, in hasManagedSaveImage
    if ret == -1: raise libvirtError ('virDomainHasManagedSaveImage() failed', dom=self)
libvirtError: Cannot write data: Broken pipe
None
#
Comment 2 Jiri Denemark 2012-05-18 06:39:46 EDT
This is because neither virsh nor virt-manager support sending their own keep alive packets (they only response to keep alive messages sent by daemons). This is trivial addition, after making a connection, the client just needs to call virConnectSetKeepAlive() API. Since this cannot be done automatically (each client may want to set different parameters for the keep alive protocol), it needs to be fixed in the client. Let's keep this bug for the required change in virsh and make a clone of it for virt-manager.
Comment 3 Gunannan Ren 2012-05-23 09:50:45 EDT
I could reproduce the bug sometimes, according to the log, the keepalive protocol works fine, the client could response the request from libvirtd. Currently, it seems not necessary to add active keepalive mechanism in the client part for this bug.

The client just goes in a loop. The following is the client log in the loop:

2012-05-23 09:29:58.163+0000: 3534: debug : virNetMessageDecodeLength:149 : Got length, now need 28 total (24 more)
2012-05-23 09:29:58.163+0000: 3534: debug : virNetClientCallDispatch:926 : RPC_CLIENT_MSG_RX: client=0x7f7e8af7d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-23 09:29:58.163+0000: 3534: debug : virKeepAliveCheckMessage:408 : ka=0x199e3c0, client=0x7f7e8af7d010, msg=0x7f7e8af7d068
2012-05-23 09:29:58.163+0000: 3534: debug : virKeepAliveCheckMessage:424 : RPC_KEEPALIVE_RECEIVED: ka=0x199e3c0 client=0x7f7e8af7d010 prog=1801807216 vers=1 proc=1
2012-05-23 09:29:58.163+0000: 3534: debug : virKeepAliveCheckMessage:428 : Got keepalive request from client 0x7f7e8af7d010
2012-05-23 09:29:58.163+0000: 3534: debug : virKeepAliveScheduleResponse:143 : Scheduling keepalive response to client 0x7f7e8af7d010
2012-05-23 09:29:58.163+0000: 3534: debug : virNetMessageNew:48 : msg=0x7f7e7c083090 tracked=0
2012-05-23 09:29:58.163+0000: 3534: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-23 09:29:58.163+0000: 3534: debug : virEventPollUpdateTimeout:259 : EVENT_POLL_UPDATE_TIMEOUT: timer=1 frequency=0
2012-05-23 09:29:58.163+0000: 3534: debug : virEventPollInterruptLocked:706 : Interrupting
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollRunOnce:630 : Poll got 1 event(s)
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchTimeouts:415 : Dispatch 2
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchTimeouts:438 : EVENT_POLL_DISPATCH_TIMEOUT: timer=1
2012-05-23 09:29:58.163+0000: 3533: debug : virKeepAliveResponseTimer:213 : ka=0x199e3c0, client=0x7f7e8af7d010, response=0x7f7e7c083090
2012-05-23 09:29:58.163+0000: 3533: debug : virKeepAliveSend:119 : Sending keepalive response to client 0x7f7e8af7d010
2012-05-23 09:29:58.163+0000: 3533: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0x199e3c0 client=0x7f7e8af7d010 prog=1801807216 vers=1 proc=2
2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientSendInternal:1672 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7e8af7d010 len=28 prog=1801807216 vers=1 proc=2 type=2 status=0 serial=0
2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientIO:1530 : Outgoing message prog=1801807216 version=1 serial=0 proc=2 type=2 length=28 dispatch=0x7f7e7c000970
2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientIO:1547 : Going to sleep 0x7f7e7c000970 0x7f7e84000a50
2012-05-23 09:29:58.163+0000: 3534: debug : virNetClientIOEventLoop:1373 : Woken up from poll by other thread
2012-05-23 09:29:58.163+0000: 3534: debug : virNetClientIOEventLoop:1386 : New non-blocking call arrived; repolling
2012-05-23 09:29:58.163+0000: 3534: debug : virNetClientIOEventLoopRemoveDone:1165 : Waking up sleep 0x7f7e84000a50
2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientIO:1556 : Wokeup from sleep 0x7f7e7c000970 0x7f7e84000a50
2012-05-23 09:29:58.163+0000: 3533: debug : virNetClientIO:1617 : All done with our call 0x7f7e7c000970 0x7f7e84000a50 2
2012-05-23 09:29:58.163+0000: 3533: debug : virNetMessageFree:75 : msg=0x7f7e7c083090 nfds=0 cb=(nil)
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollUpdateTimeout:259 : EVENT_POLL_UPDATE_TIMEOUT: timer=1 frequency=-1
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollInterruptLocked:702 : Skip interrupt, 1 -1952717056
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchHandles:460 : Dispatch 2
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchHandles:474 : i=0 w=1
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchHandles:488 : EVENT_POLL_DISPATCH_HANDLE: watch=1 events=1
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollDispatchHandles:474 : i=2 w=3
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCleanupTimeouts:506 : Cleanup 2
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCleanupHandles:554 : Cleanup 3
2012-05-23 09:29:58.163+0000: 3533: debug : virEventRunDefaultImpl:244 : running default event implementation
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCleanupTimeouts:506 : Cleanup 2
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCleanupHandles:554 : Cleanup 3
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollMakePollFDs:383 : Prepare n=0 w=1, f=4 e=1 d=0
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollMakePollFDs:383 : Prepare n=1 w=2, f=6 e=0 d=0
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollMakePollFDs:383 : Prepare n=2 w=3, f=11 e=1 d=0
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCalculateTimeout:325 : Calculate expiry of 2 timers
2012-05-23 09:29:58.163+0000: 3533: debug : virEventPollCalculateTimeout:351 : Timeout at 0 due in -1 ms
Comment 4 Gunannan Ren 2012-05-24 08:35:39 EDT
During the process of running domainMigratePrepare3, run "service network stop" on source host, according to the log, the rpc will return wrong message with memory-messy header and the serial number is not right as well. 
Need to debug the libvirtd further.


2012-05-23 09:29:08.181+0000: 3534: debug : virNetClientSendInternal:1672 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7e8af7d010 len=396 prog=536903814 vers=1 proc=216 type=0 status=0 serial=10
2012-05-23 09:29:08.181+0000: 3534: debug : virNetClientIO:1530 : Outgoing message prog=536903814 version=1 serial=10 proc=216 type=0 length=396 dispatch=(nil)
2012-05-23 09:29:08.181+0000: 3534: debug : virNetClientIO:1588 : We have the buck 0x7f7e7c000970 0x7f7e7c000970

...

2012-05-23 09:29:13.181+0000: 3534: debug : virNetClientCallDispatch:926 : RPC_CLIENT_MSG_RX: client=0x7f7e8af7d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
Comment 5 Gunannan Ren 2012-05-24 11:50:32 EDT
The comment 4 is not right analysis.
The client looks like hanging up, actually it is not, after the network is cut off, the client has no idea what happened and is just waiting forever(until tcp timeout), the keepalive message is still working between client and local libvirtd. But the keepalive communication between the client and target machine couldn't work anymore because of the network disconnection. 
So we got the following error after stoping the client forcedly by hand.

2012-05-18 06:24:09.019+0000: 25903: warning : virKeepAliveSend:128 : Failed to send keepalive response to client 0x7f23c0002f50

The comment 2 gave the right solution.
Comment 7 Gunannan Ren 2012-06-01 05:34:33 EDT
For the non-p2p migration, with the help of keepalive protocol the client knows the disconnection to target machine, then the client should invoke virshDomainAbortJob() to abort the domainMigratePerform3 job that is running on source machine.
The blocking qemu-kvm bug is that the qemu command "migrate_cancel" blocks and fails to cancel ongoing migration job if the network is disconnected.
The problem exists on Fedora16 but the fix is verified by trying to use the newest qemu stream code. So patch will be soon.
Comment 8 Gunannan Ren 2012-06-18 09:59:55 EDT
patch sent upstream
https://www.redhat.com/archives/libvir-list/2012-June/msg00419.html
Comment 14 Martin Kletzander 2014-02-24 02:28:21 EST
This is a long-standing bug, can you confirm it still happens on 6.5?  I would say there's a high possibility of it being fixed already.
Comment 16 zhoujunqin 2014-02-24 05:36:10 EST
(In reply to Martin Kletzander from comment #14)
> This is a long-standing bug, can you confirm it still happens on 6.5?  I
> would say there's a high possibility of it being fixed already.

Hi Martin Kletzander,
i also hit the problem "virsh comand hangs" on 6.5 system,but "virt-manager"
works well as expected.

Packages version:
libvirt-0.10.2-29.el6_5.4.x86_64
virt-manager-0.9.0-19.el6.x86_64
python-virtinst-0.600.0-18.el6.noarch
# uname -r
2.6.32-431.7.1.el6.x86_64

Setup NFS server and mounted it on source and target.

Steps to Reproduce:
1. Start a domain;
# virsh list
 Id    Name                           State
----------------------------------------------------
 5     rhel6test                      running

2. Start a live migration;
# virsh migrate rhel6test --live qemu+ssh://target/system --verbose
root@target's password:

3. Open virt-manager connect two hosts for monitor the process of migration.

4. While the migration is happening, stop network on target host.
# virsh migrate rhel6test --live qemu+ssh://target/system --verbose

Migration: [ 92 %]
# virsh list
 Id    Name                           State
----------------------------------------------------
 6     rhel6test                      running

(1)The migration command don't prompt any message and don't stop automatic if migration does not success, virsh command hangs.
(2)virt-manager works well.
Comment 17 Martin Kletzander 2014-02-26 05:44:48 EST
How long did you wait for virsh to finish?  Please try that again and let it settle for a while.  The ssh connection must time out in order for the whole connection to time out.  You can check it works by connecting to the machine by ssh before the migration and if that connection times out before the virsh one does than virsh has a problem, but there is no possibility of virsh to get stuck if there is no connection.

I'm closing it as NOTABUG, feel free to reopen if virsh gets stuck for longer than the ssh you're connecting through.
Comment 18 zhoujunqin 2014-02-26 22:36:01 EST
(In reply to Martin Kletzander from comment #17)

Hi Martin Kletzander,
I have followed your suggestion and tried again with following steps:

1. Start a domain on source host:
# virsh list
 Id    Name                           State
----------------------------------------------------
 5     rhel6test                      running
the same time connect to the target host by ssh connection:
# ssh root@10.66.85.74 -X
root@10.66.85.74's password: 

2. Start a live migration;
# virsh migrate rhel6test --live qemu+ssh://target/system --verbose
root@target's password:

3.While the migration is happening, stop network on target host.
target host:
#service network stop
#date 
Thu Feb 27 10:39:07 CST 2014

source host:
# virsh migrate --live rhel6test qemu+ssh://10.66.85.74/system --verbose 
root@10.66.85.74's password: 
Migration: [ 89 %]

4.after ssh connection to target host met timeout,check the result both on source host and target host:
on target host:
[root@intel-e31225-8-3 ~]# Write failed: Broken pipe
#date
Thu Feb 27 10:57:26 CST 2014

on source host:
# virsh migrate --live rhel6test qemu+ssh://10.66.85.74/system --verbose 
root@10.66.85.74's password: 
Migration: [ 89 %]error: operation failed: migration job: unexpectedly failed
#date
Thu Feb 27 11:10:01 CST 2014

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 10    rhel6test                      running


5.compared the date time,it showed that virsh gets the error output longer than ssh.
so what do you think about this result,let me know,thanks.
Comment 19 Martin Kletzander 2014-02-28 02:02:40 EST
(In reply to zhoujunqin from comment #18)
I'd say this is a timeout that's not as reasonable as I was expecting.  Let's keep it open then and I'll add the support for keepalive in virsh.
Comment 22 Martin Kletzander 2014-03-18 03:50:16 EDT
Fixed upstream with commit v1.2.2-201-g676cb4f:

commit 676cb4f4e762b8682a06c6dab1f690fbcd939550
Author: Martin Kletzander <mkletzan@redhat.com>
Date:   Thu Mar 6 17:20:11 2014 +0100

    virsh: Add keepalive in new vshConnect function

in series with optional commit v1.2.2-202-gb0cf7d6:

commit b0cf7d64614ea1000424534ebbd5738d254c7410
Author: Martin Kletzander <mkletzan@redhat.com>
Date:   Fri Mar 7 11:15:39 2014 +0100

    virsh: Prohibit virConnectOpen* functions in virsh
Comment 25 RHEL Product and Program Management 2014-04-04 18:56:18 EDT
Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Note You need to log in before you can comment on or make changes to this bug.