Bug 983350

Summary: The running Guest was paused while cancel the migration on the third machine
Product: Red Hat Enterprise Linux 7 Reporter: zhenfeng wang <zhwang>
Component: libvirtAssignee: Peter Krempa <pkrempa>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0CC: ajia, dyuan, gsun, mzhan, pkrempa, rbalakri, vivianzhang, ydu, zpeng
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-1.2.7-1.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 983348 Environment:
Last Closed: 2015-03-05 07:20:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 983348    
Bug Blocks:    
Attachments:
Description Flags
client debug log
none
libvirtd source tar log
none
libvirtd target tar log none

Description zhenfeng wang 2013-07-11 03:35:31 UTC
+++ This bug was initially created as a clone of Bug #983348 +++


Description of problem:
The running Guest was paused while cancel the migration on the third machine which connect the source machine with the remote access

Version-Release number of selected component (if applicable):
kernel-2.6.32-358.2.1.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.2.x86_64
libvirt-0.10.2-19.el6.x86_64
How reproducible:
100%

Steps
1. set setenforce 1 && virt_use_nfs 1 (on both source and target)

2.prepare a guest which the image file is on the NFS server,and mount the nfs server on both source and target
start the guest on the source machine
#virsh start rhelguest1
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     rhelguest1                         running
3.Start the migrataion on the third machine
# virsh -c qemu+ssh://xx.xx.xx.xx/system migrate rhelguest1 --live qemu+ssh://yy.yy.yy.yy/system --verbose
The authenticity of host 'xx.xx.xx.xx (xx.xx.xx.xx)' can't be established.
RSA key fingerprint is ce:52:b1:64:6c:0c:23:25:1d:9c:22:17:7b:66:0b:68.
Are you sure you want to continue connecting (yes/no)? yes
root.83.194's password:
root.83.191's password:
Migration: [ 31 %]^Cerror: internal error received hangup / error event on socket

4.Check the guest's status on the source host,the guest was in paused status
# virsh list
 Id    Name                           State
----------------------------------------------------
 7     rhelguest1                     paused

5.The guest won't be paused while cancel the migration on the source host directly

Actual results:
The running Guest was paused while cancel the migration on the third machine which connect the source machine with the remote access
Expected results:
The guest should keep running status

Comment 1 zhenfeng wang 2013-07-11 03:39:56 UTC
The guest won't always be paused in rhel7,it always happens while the migration was finished more then 90%,just like

# virsh -c qemu+ssh://xx.xx.xx.xx/system  migrate --live rhel73 qemu+ssh://yy.yy.yy.yy/system --verbose --unsafe
root.xx.xx's password: 
root.yy.yy's password: 
Migration: [ 96 %]^Cerror: internal error received hangup / error event on socket
error: One or more references were leaked after disconnect from the hypervisor
root.xx.xx's password: 
error: Reconnected to the hypervisor

Comment 4 Peter Krempa 2013-09-03 08:04:58 UTC
Fixed upstream with:

commit b46c4787dde79b015dad67dedda4ccf6ff1a3082
Author: Peter Krempa <pkrempa>
Date:   Thu Aug 29 15:18:20 2013 +0200

    virsh-domain: Avoid killing ssh transport tunnels when cancelling job
    
    The vshWatchJob function registers a SIGINT handler that is used to
    abort the active job and does not terminate virsh. Unfortunately, this
    breaks when using the ssh transport as SIGINT is sent to the foreground
    process group including the ssh transport processes which terminate.
    This breaks the connection and migration is left in a insane state.
    
    With this patch the terminal is modified to ignore key binding that
    sends SIGINT and does the handling manually.
    
    Resoves: https://bugzilla.redhat.com/show_bug.cgi?id=983348

commit ebef68936396f7eab077e883ac48c4ce0508afa2
Author: Peter Krempa <pkrempa>
Date:   Thu Aug 29 10:36:00 2013 +0200

    virsh: Remember terminal state when starting and add helpers
    
    This patch adds instrumentation to allow modification of config of the
    terminal in virsh and successful reset of the state afterwards.
    
    The added helpers allow to disable receiving of SIGINT when pressing the
    key sequence (Ctrl+C usualy). This normally sends SIGINT to the
    foreground process group which kills ssh processes used for transport of
    the data.

commit 8c725cc10daa666d47ab5a4f2ccc0b196ab608d8
Author: Peter Krempa <pkrempa>
Date:   Mon Aug 26 12:31:51 2013 +0200

    virsh-domain: rename print_job_progress to vshPrintJobProgress

Comment 7 zhengqin 2014-08-26 09:09:21 UTC
Verify this issue with libvirt-1.2.7-1.el7.x86_64:


1. Set setenforce 1 && virt_use_nfs 1 (on both source and target)

2.prepare a guest which the image file is on the NFS server,and mount the nfs server on both source and target

3. start the guest on the source machine

4. Start the migrataion on the third machine, and cancel the migration during about 96%

[root@rhel7-c ~]# virsh -c qemu+ssh://10.66.6.xx/system migrate rhel7 --live qemu+ssh://10.66.4.xx/system --verbose
root.6.xx's password: 
root.4.xx's password: 



Migration: [  1 %]
Migration: [ 61 %]
Migration: [ 73 %]
Migration: [ 73 %]^[[A
Migration: [ 74 %]
Migration: [ 76 %]
Migration: [ 81 %]
Migration: [ 95 %]
Migration: [ 96 %]error: operation aborted: migration job: canceled by client

4. The guest is still in Running status on source side, and not displayed on target side.

Comment 8 vivian zhang 2014-11-10 08:37:08 UTC
Hello, peter
when I do regression for this bug on rhel7.1, I found that after cancel the migration, the reported error still not accurate, but guest is still in running status. Could you please help me check whether it is a known issue for this bug?

Version-Release number of selected component (if applicable):
libvirt-1.2.8-6.el7.x86_64
qemu-kvm-rhev-2.1.2-6.el7.x86_64
kernel-3.10.0-195.el7.x86_64


How reproducible:
100%

Steps to Reproduce:

1. set setenforce 1 && virt_use_nfs 1 (on both source and target)

2.prepare a guest which the image file is on the NFS server,and mount the nfs server on both source and target
start the guest on the source machine
# virsh list
 Id    Name                           State
----------------------------------------------------
80    vm2                            running

3. start migration on the third machine, and ctrl+c to cancel the migration
# virsh -c qemu+ssh://10.66.7.206/system migrate vm2 --live qemu+ssh://10.66.6.205/system --verbose
root.7.206's password: 
root.6.205's password: 
Migration: [  3 %]^Cerror: internal error: received hangup / error event on socket
root.7.206's password: 
error: Reconnected to the hypervisor

4. check the guest status again
# virsh list
 Id    Name                           State
----------------------------------------------------
 80    vm2                            running

you can see that after ctrl+c the migration, the reported error seems still not accurate, and meanwhile to ask me input the source host password again. 
I think it would better to show the result as "error: operation aborted: migration job: canceled by client"

Hope for your reply, thanks
vivian zhang

Comment 9 Peter Krempa 2014-11-10 12:56:48 UTC
(In reply to vivian zhang from comment #8)

> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 
> 1. set setenforce 1 && virt_use_nfs 1 (on both source and target)
> 
> 2.prepare a guest which the image file is on the NFS server,and mount the
> nfs server on both source and target
> start the guest on the source machine
> # virsh list
>  Id    Name                           State
> ----------------------------------------------------
> 80    vm2                            running
> 
> 3. start migration on the third machine, and ctrl+c to cancel the migration
> # virsh -c qemu+ssh://10.66.7.206/system migrate vm2 --live

Did you also upgrade libvirt on the machine running this command? As the issue was caused on the client side, it's necessery to specially upgrade the host running the virsh command.

To make sure, please run "virsh version"

Comment 10 vivian zhang 2014-11-11 01:40:08 UTC
(In reply to Peter Krempa from comment #9)
> (In reply to vivian zhang from comment #8)
> 
> > How reproducible:
> > 100%
> > 
> > Steps to Reproduce:
> > 
> > 1. set setenforce 1 && virt_use_nfs 1 (on both source and target)
> > 
> > 2.prepare a guest which the image file is on the NFS server,and mount the
> > nfs server on both source and target
> > start the guest on the source machine
> > # virsh list
> >  Id    Name                           State
> > ----------------------------------------------------
> > 80    vm2                            running
> > 
> > 3. start migration on the third machine, and ctrl+c to cancel the migration
> > # virsh -c qemu+ssh://10.66.7.206/system migrate vm2 --live
> 
> Did you also upgrade libvirt on the machine running this command? As the
> issue was caused on the client side, it's necessery to specially upgrade the
> host running the virsh command.
> 
> To make sure, please run "virsh version"

hi, Peter
the libvirt version has been updated to as below

# virsh version
Compiled against library: libvirt 1.2.8
Using library: libvirt 1.2.8
Using API: QEMU 1.2.8
Running hypervisor: QEMU 2.1.2

Comment 11 Peter Krempa 2014-11-12 15:54:00 UTC
(In reply to vivian zhang from comment #10)
> (In reply to Peter Krempa from comment #9)
> > (In reply to vivian zhang from comment #8)

...

> 
> hi, Peter
> the libvirt version has been updated to as below
> 
> # virsh version
> Compiled against library: libvirt 1.2.8
> Using library: libvirt 1.2.8
> Using API: QEMU 1.2.8
> Running hypervisor: QEMU 2.1.2

In that case this should not happen. Can you please provide debug logs from both the client and the daemon that would show the issue happening.

Comment 12 vivian zhang 2014-11-13 02:09:28 UTC
(In reply to Peter Krempa from comment #11)
> (In reply to vivian zhang from comment #10)
> > (In reply to Peter Krempa from comment #9)
> > > (In reply to vivian zhang from comment #8)
> 
> ...
> 
> > 
> > hi, Peter
> > the libvirt version has been updated to as below
> > 
> > # virsh version
> > Compiled against library: libvirt 1.2.8
> > Using library: libvirt 1.2.8
> > Using API: QEMU 1.2.8
> > Running hypervisor: QEMU 2.1.2
> 
> In that case this should not happen. Can you please provide debug logs from
> both the client and the daemon that would show the issue happening.


hi,Peter
I captured 3 logs:
1. use debug command on the third machine to get client log with name client1113.log
# LIBVIRT_DEBUG=1 virsh -c qemu+ssh://10.66.7.206/system migrate rhel6new --live qemu+ssh://10.66.6.205/system --verbose 

2. the source and target libvirtd.log with setting log_level=1

please check firstly, anything unclear, please contact me.

thanks

vivianzhang

Comment 13 vivian zhang 2014-11-13 02:11:34 UTC
Created attachment 956908 [details]
client debug log

Comment 14 vivian zhang 2014-11-13 02:18:04 UTC
Created attachment 956909 [details]
libvirtd source tar log

Comment 15 vivian zhang 2014-11-13 02:18:49 UTC
Created attachment 956910 [details]
libvirtd target tar log

Comment 16 vivian zhang 2014-12-23 02:47:36 UTC
I can produce this bug on build
libvirt-1.1.1-29.el7.x86_64
qemu-kvm-rhev-1.5.3-60.el7_0.9.x86_64

I could not reproduce the issue described as comment8 anymore,  so verify it on the latest build
libvirt-1.2.8-11.el7.x86_64
qemu-kvm-rhev-2.1.2-17.el7.x86_64

verify steps:

1. prepare a migration env with img mount with nfs server on both source and target host

2. setenforce 1 and virt_us_nfs on

3. prepare the third machine, do migration, cancel the process nearly 90% 
# virsh -c qemu+ssh://xx.xx.xx.xx/system migrate rhel7 --live qemu+ssh://xx.xx.xx.xx/system --verbose
root.xx.xx's password: 
root.xx.xx's password: 
Migration: [ 45 %]
Migration: [ 47 %]
Migration: [ 55 %]
Migration: [ 62 %]
Migration: [ 71 %]
Migration: [ 82 %]
Migration: [ 88 %]
Migration: [ 90 %]
Migration: [ 92 %]
Migration: [ 94 %]
Migration: [ 95 %]^Cerror: operation aborted: migration job: canceled by client

4. check the guest on source host, still running, and works well
# virsh list
 Id    Name                           State
----------------------------------------------------
 10    rhel7                          running


5. configure the guest with spice connection, open it using virt-viewer, repeat step 3-4, get the same result
# virsh -c qemu+ssh://xx.xx.xx.xx/system migrate rhel7 --live qemu+ssh://xx.xx.xx.xx/system --verbose
root.xx.xx's password: 
root.xx.xx's password: 
Migration: [ 95 %]^Cerror: operation aborted: migration job: canceled by client


move to verified

Comment 18 errata-xmlrpc 2015-03-05 07:20:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html