Bug 705405

Summary: Libvirt: libvirt kills VM when SD connectivity is blocked on destination during VM migration
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.1CC: ajia, berrange, dallan, danken, dyuan, gren, mgoldboi, mzhan, ohochman, rwu, syeghiay, weizhan
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-0.9.2-1.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 11:09:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logs
none
logs
none
logs requested none

Description Dafna Ron 2011-05-17 15:24:49 UTC
Created attachment 499378 [details]
logs

Description of problem:

starting VM migration from SPM host to HSM host and blocking SD connectivity during migration causes libvirt to kill the VM. 

Version-Release number of selected component (if applicable):

ic117
vdsm-4.9-65.el6.x86_64
libvirt-0.8.7-18.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. start VM migration 
2. block SD connectivity in destination host using iptables
3.
  
Actual results:

libvirt will kill VM

Expected results:

VM should not be killed. 

Additional info:logs

Comment 1 Dafna Ron 2011-05-17 15:28:07 UTC
Created attachment 499380 [details]
logs

accidental attached wrong logs.
correct logs attached now

Comment 2 Daniel Berrangé 2011-05-17 15:36:00 UTC
The logs provided aren't usable since they contain data from many different operations & guests, and the log level is excluding all QEMU driver info.

Please edit /etc/libvirt/libvirtd.conf and set the following

  log_filters="1:qemu 3:event 1:util 1:security"
  log_outputs="1:file:/var/log/libvirt/libvirtd.log"

Do this on both the source and destination hosts used for migration and then

  rm -f /var/log/libvirt/libvirtd.log
  service libvirtd restart

and execute *1* single migration attempt demonstrating the problem, and then.

  service libvirtd stop

and attach the resulting /var/log/libvirt/libvirtd.log from both source & destination to this bug, so that we have a log with only the information for 1 guest and 1 migration attempt.

Also please provide the XML for the guest, and the /var/log/libvirt/qemu/$GUESTNAME.log file from both source and dest hosts.

Comment 4 Dafna Ron 2011-05-19 12:02:15 UTC
Created attachment 499818 [details]
logs requested

Comment 5 Daniel Berrangé 2011-05-20 08:27:30 UTC
According to those logs everything is working normally.

libvirt.log.source shows migration starting, and finally completing without error:

14:45:55.650: 19216: debug : virJSONValueToString:1062 : result={"execute":"query-migrate"}
14:45:56.441: 19212: debug : virJSONValueFromString:933 : string={"return": {"status": "completed"}}


libvirt.log.dest shows that the target VM started up and accepted the incoming migration, which completed, resulting in a running guest

14:46:53.579: 21160: debug : qemuMonitorJSONIOProcessLine:116 : Line [{"timestamp": {"seconds": 1305805613, "microseconds": 579538}, "event": "RESUME"}]
14:46:53.579: 21160: debug : virJSONValueFromString:933 : string={"timestamp": {"seconds": 1305805613, "microseconds": 579538}, "event": "RESUME"}


Please provide more details of where the actual problem is ?

Comment 7 Daniel Berrangé 2011-05-23 09:21:27 UTC
> the problem is that a guest migration from 1 host which has connectivity to
> storage to a 2ed host which looses connectivity to storage (mid migration) will
> cause the guest to shut down instead of the migration to fail. 
>
> 1) we can see in the RHEVM GUI that the guest becomes non-responsive before it
> stops - this usually means that the vdsm lost connectivity to libvirt.
> 2) the libvirt seem to loose connection to the kvm process and kills the vm.
> I spoke to vdsm who said that it looks like a libvirt or kvm issue - but since
> libvirt is the one that kills the guest, I thought we should check with you
> first. 

As outlined in comment #5, the logs of libvirtd you provided show no evidence that the VM on the destination host shutdown. The migration completed successfully and the VM is running on the destination. The source VM of course has shutdown, as is normal when migration completes.

Please provide updated logs which actually demonstrate the problem you're describing.

Comment 8 Jiri Denemark 2011-05-24 08:48:47 UTC
So the interesting part of logs (from target machine) is:

10:12:09.386: 13166: debug : virDomainMigrateFinish2:4136 :
    dconn=0x7f3f94007cf0, dname=omri_xp, cookie=0x7f3f98003bc0,
    cookielen=333, uri=tcp:blond-vdsg.qa.lab.tlv.redhat.com:49157,
    flags=3, retcode=0
10:12:09.936: 13165: error : virStorageFileGetMetadataFromFD:832 :
    cannot read header '/rhev/data-center/91e7a658-5f50-40bc-8ccd-004f8f3de868/6f747221-9351-4fc5-87b6-9294257b7c0b/images/98753b08-b818-4995-94ec-f94197241a7c/a9e00ce5-0658-4c17-a2d8-ae8708479aa2': Input/output error
10:12:09.936: 13165: debug : virDomainFree:2294 :
    dom=0x7f3f8c09e410, (VM: name=omri_xp,
    uuid=723a6e62-772f-4a42-9896-7c31cb7a4976), 
10:12:09.936: 13166: debug : qemuMonitorStartCPUs:954 : mon=0x7f3f8c0da160
10:12:09.937: 13165: warning : virEventUpdateHandleImpl:139 :
    Ignoring invalid update watch -1
10:12:09.937: 13161: debug : virConnectClose:1570 : conn=0x7f3f9010e030
10:12:10.087: 13161: debug : qemuMonitorIO:601 :
    Triggering EOF callback error? 1
10:12:10.087: 13161: debug : qemuHandleMonitorEOF:741 :
    Received EOF on 0x7f3f8c0a08b0 'omri_xp'
10:12:10.087: 13161: debug : qemudShutdownVMDaemon:3460 :
    Shutting down VM 'omri_xp' pid=31721 migrated=0
10:12:10.091: 13166: error : qemuMonitorJSONCommandWithFd:243 :
    cannot send monitor command '{"execute":"cont"}': Connection reset by peer
10:12:10.094: 13161: error : qemudShutdownVMDaemon:3517 :
    Failed to send SIGTERM to omri_xp (31721): No such process

So while target libvirtd is in the Finish2 phase (which means the domain is
already destroyed on the source) and tries to resume it, the qemu process is
no longer there so it fails. This is fixed by migration v3 protocol which
ensures that the domain on the source is not killed until target confirmed
that the domain was successfully resumed there.

The reason why qemu process vanished can be seen in
/var/log/libvirt/qemu/omri_xp.log:

qemu: could not open disk image
/rhev/data-center/91e7a658-5f50-40bc-8ccd-004f8f3de868/6f747221-9351-4fc5-87b6-9294257b7c0b/images/98753b08-b818-4995-94ec-f94197241a7c/a9e00ce5-0658-4c17-a2d8-ae8708479aa2:
Input/output error
qemu: re-open of
/rhev/data-center/91e7a658-5f50-40bc-8ccd-004f8f3de868/6f747221-9351-4fc5-87b6-9294257b7c0b/images/98753b08-b818-4995-94ec-f94197241a7c/a9e00ce5-0658-4c17-a2d8-ae8708479aa2
failed wth error -1
reopening of drives failed
2011-05-24 10:12:10.088: shutting down

Comment 9 Omri Hochman 2011-05-24 09:41:01 UTC
*** Bug 707164 has been marked as a duplicate of this bug. ***

Comment 10 Jiri Denemark 2011-05-31 11:11:21 UTC
This will be fixed by rebasing libvirt to 0.9.2 since it contains migration v3 patches. I won't close this bug as a duplicate of migration v3 BZ so that this can be tested and verified as fixed separately.

Comment 11 Daniel Veillard 2011-06-23 03:16:45 UTC
This should be fixed by the libvirt-0.9.2-1.el6 rebase

Comment 12 weizhang 2011-07-04 11:52:59 UTC
verify pass on 
kernel-2.6.32-156.el6.x86_64
libvirt-0.9.2-1.el6.x86_64
qemu-kvm-0.12.1.2-2.165.el6.x86_64

steps:
1. prepare a individual nfs server which is not source or dest host
2. mount nfs on both source and dest host
3. start a guest which storage is located on shared dir
4. "setenforce 1" && "setsebool virt_use_nfs 1" on both sides
5. do migration on source 
#  virsh migrate --live vr-rhel6-i386-kvm qemu+ssh://10.66.83.175/system
6. at the same time, on dest, do 
# iptables -A OUTPUT -d {nfs_ip} -p tcp --dport 2049 -j DROP
7. on source host, see the guest status

on libvirt-0.9.1-1.el6.x86_64, the guest will be shutoff but migration still not finished
on libvirt-0.9.2-1.el6.x86_64, the guest will be always on running status

Comment 14 Rita Wu 2011-07-06 10:22:12 UTC
Set it as VERIFIED per comment12

Comment 15 errata-xmlrpc 2011-12-06 11:09:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1513.html