Bug 656845

Summary:	[libvirt] kill qemu process using 'kill' behaves as user shutdown command and not as 'lost connection with qemu process'
Product:	Red Hat Enterprise Linux 6	Reporter:	Haim <hateya>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	6.1	CC:	abaron, berrange, cpelland, dallan, danken, dyuan, eblake, hateya, iheim, jyang, kxiong, mgoldboi, mjenner, plyons, xen-maint, yeylon, yimwang, ykaul
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-0.8.6-1.el6	Doc Type:	Bug Fix
Doc Text:	libvirt could not determine whether a domain had crashed or been correctly shut down. This update adds recognition of the SHUTDOWN event sent by qemu when a server is shut down correctly. If this event is not received, the domain is now declared to have crashed.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-05-19 13:24:29 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	662042

Description Haim 2010-11-24 10:08:03 UTC

Description of problem:

there are different behaviors in libvirt when domain goes down, specifically, when killing qemu process. 
my problem is that when I kill the qemu process, libvirt behaves as if domain went down gracefully (shutdown event is sent). 


10:55:10.716: debug : qemuMonitorEmitShutdown:763 : mon=0x7fc87800d780
10:55:10.716: debug : qemuMonitorJSONIOProcess:188 : Total used 85 bytes out of 85 available in buffer
10:55:10.884: debug : qemuMonitorIO:574 : Triggering EOF callback error? 0
10:55:10.884: debug : qemuHandleMonitorEOF:1117 : Received EOF on 0x7fc87800dce0 'fedora13-pool-01'
10:55:10.884: debug : qemudShutdownVMDaemon:4256 : Shutting down VM 'fedora13-pool-01' migrated=0

however, when I kill the vm with virsh destroy, I don't see equivalent event, but, i see the following:

11:54:32.775: debug : virDomainFree:2215 : domain=0x7f331c003a40
11:54:32.775: debug : remoteRelayDomainEventLifecycle:118 : Relaying domain lifecycle event 5 1
11:54:32.775: debug : virDomainFree:2215 : domain=0x7f332c05b020
11:54:32.775: debug : virConnectClose:1497 : conn=0x7f3318006890
11:54:32.782: debug : virDomainGetXMLDesc:3038 : domain=0x7f33180018d0, flags=0
11:54:32.782: error : qemudDomainDumpXML:7023 : Domain not found: no domain with matching uuid '18c11d40-c2c6-43aa-8629-97a46fe85608'
11:54:32.782: debug : virDomainFree:2215 : domain=0x7f33180018d0
11:54:32.785: debug : virDomainGetXMLDesc:3038 : domain=0x7f3324354ed0, flags=0
11:54:32.785: error : qemudDomainDumpXML:7023 : Domain not found: no domain with matching uuid '18c11d40-c2c6-43aa-8629-97a46fe85608'
11:54:32.785: debug : virDomainFree:2215 : domain=0x7f3324354ed0
11:54:33.254: debug : virDomainDestroy:2172 : domain=0x7f331400faf0
11:54:33.254: error : qemudDomainDestroy:5029 : Domain not found: no domain with matching uuid '18c11d40-c2c6-43aa-8629-97a46fe85608'
11:54:33.254: debug : virDomainFree:2215 : domain=0x7f331400faf0

and on vdsm i get the following: 

libvirtEventLoop::INFO::2010-11-24 11:44:17,730::vm::1003::vds.vmlog.18c11d40-c2c6-43aa-8629-97a46fe85608::(_onQemuDeath) underlying process disconnected
libvirtEventLoop::DEBUG::2010-11-24 11:44:17,734::vm::1748::vds.vmlog.18c11d40-c2c6-43aa-8629-97a46fe85608::(setDownStatus) Changed state to Down: Lost connect
ion with kvm process


for bottom line, when I kill the process using pkill or kill, I expect the behavior to be as if process went down unexpectedly (as if in destroy), and not user shutdown event.  

repro steps:

1) start domain 
   - pkill qemu
   - or virsh destroy

Comment 2 Dave Allan 2010-12-02 04:51:09 UTC

Does the behavior differ if you send SIGKILL?

Comment 3 Jiri Denemark 2010-12-06 14:16:27 UTC

This is not as trivial as it sounds to be. If the domain is killed with virsh destroy, libvirt knows it killed the domain and appropriate event can be emitted easily. On the other hand, if qemu process just vanishes we have to guess externally what happened to the domain. However, we don't have a large set of possible ways to detect that:

(1) waitpid(), which is probably what vdsm uses, is the most reliable option but unusable because libvirtd starts qemu processes daemonized with init being their parent. Even if we changed libvirt to be a parent of all qemu processes it starts, restarting libvirtd would make init their parent and we wouldn't be able to use waitpid anyway.

(2) relying on powerdown event from qemu and if we don't see that event before monitor connection is closed, we could treat it as abnormal termination.

Do we have another option?

Comment 4 Daniel Berrangé 2010-12-06 14:27:43 UTC

> (1) waitpid(), which is probably what vdsm uses, is the most reliable option

This is not an option. We explicitly don't want libvirt to be a parent of any VM processes for stability/reliability reasons.

> (2) relying on powerdown event from qemu and if we don't see that event before
> monitor connection is closed, we could treat it as abnormal termination.

This is reasonable. We should always get an event unless QEMU was horribly killed, or crashed.


Either way IMHO this is not a priority, since directly killing QEMU PIDs is unnecessary, libvirt provides an API to do this in a supportable manner.

Comment 5 Dan Kenigsberg 2010-12-06 14:49:56 UTC

(In reply to comment #4)
> Either way IMHO this is not a priority, since directly killing QEMU PIDs is
> unnecessary, libvirt provides an API to do this in a supportable manner.

What we fear is qemu crash. When this happens, RHEV-M-2.2 has a feature that restarts the VM on another host (if the VM is defined as "highly-available"). If qemu is shut down politely (within guest), this should not happen.

We must distinguish orderly shutdowns from crashes.

Comment 10 Dave Allan 2010-12-07 02:59:27 UTC

Dan Kenigsberg, what mechanism do you currently use to distinguish crashes from orderly shutdowns?

Comment 11 Dan Kenigsberg 2010-12-07 08:36:06 UTC

(In reply to comment #10)
> Dan Kenigsberg, what mechanism do you currently use to distinguish crashes from
> orderly shutdowns?

In RHEL5 vdsm listens to qemu's shutdown/powerdown events. If they are received before qemu process disappears, we consider this an orderly shutdown. Anything else is a crash (unless initiated by vdsm). AFAICT that's Jiří's option (2) in comment 3. (vdsm has the same issue with waitpid as libvirt)

Comment 13 Jiri Denemark 2010-12-09 12:45:19 UTC

Fixed upstream by v0.8.6-71-gc778fe9:

commit c778fe967808eb2426ed4851db3ec49a0cdc76ca
Author: Jiri Denemark <jdenemar>
Date:   Thu Dec 9 11:18:32 2010 +0100

    qemu: Distinguish between domain shutdown and crash
    
    When we get an EOF event on monitor connection, it may be a result of
    either crash or graceful shutdown. QEMU which supports async events
    (i.e., we are talking to it using JSON monitor) emits SHUTDOWN event on
    graceful shutdown. In case we don't get this event by the time monitor
    connection is closed, we assume the associated domain crashed.

Comment 16 wangyimiao 2010-12-22 09:02:48 UTC

verified it PASSED on build :
libvirt-0.8.1-29.el6.x86_64
libvirt-client-0.8.1-29.el6.x86_64
qemu-kvm-0.12.1.2-2.128.el6.x86_64
qemu-img-0.12.1.2-2.128.el6.x86_64
kernel-2.6.32-93.el6.x86_64

Steps:
1.After kill the qemu process，I can get the info from "linvirtd.log":
...............................

10:03:26.051: debug : qemuHandleMonitorEOF:1122 : Received EOF on 0x1901ef0 'vm11'
10:03:26.051: debug : qemuHandleMonitorEOF:1128 : Monitor connection to 'vm11' closed without SHUTDOWN event; assuming the domain crashed
10:03:26.051: debug : qemudShutdownVMDaemon:4289 : Shutting down VM 'vm11' migrated=0
10:03:26.051: debug : qemuMonitorClose:690 : mon=0x1933ca0
10:03:26.051: debug : virEventRemoveHandleImpl:174 : Remove handle w=5
10:03:26.051: debug : virEventRemoveHandleImpl:187 : mark delete 4 16
10:03:26.051: debug : virEventInterruptLocked:664 : Skip interrupt, 1 732268304
10:03:26.052: debug : qemuSecurityDACRestoreSecurityAllLabel:426 : Restoring security label on vm11 migrated=0
10:03:26.052: info : qemuSecurityDACRestoreSecurityFileLabel:80 : Restoring DAC user and group on '/mnt/yimwang/vm11.img'
10:03:26.052: info : qemuSecurityDACSetOwnership:40 : Setting DAC user and group on '/mnt/yimwang/vm11.img' to '0:0'
10:03:26.052: debug : SELinuxRestoreSecurityAllLabel:737 : Restoring security label on vm11
10:03:26.052: info : SELinuxRestoreSecurityFileLabel:367 : Restoring SELinux context on '/mnt/yimwang/vm11.img'
.........................................

Comment 17 dyuan 2010-12-27 11:12:12 UTC

Verified PASSED with libvirt-0.8.6-1.el6.

Comment 18 wangyimiao 2010-12-28 04:43:41 UTC

Detail steps for comment 16:

Steps:
1.Define and start a domain.
# virsh start http_test
Domain http_test started

2.Edit "libvirtd.conf" file 
# vi /etc/libvirt/libvirtd.conf
log_level = 1
log_outputs="1:file:/var/lib/libvirt/images/libvirtd.log"

3.# service libvirtd stop

4.# libvirtd

5.Open the second terminal.
# ps -ef|grep qemu
qemu      2038     1  1 04:31 ?        00:00:56 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name http_test -uuid 3d3a297b-8039-7143-033a-7ca7d9feb676 -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/http_test.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/var/lib/libvirt/images/http_test.img,if=none,id=drive-ide0-0-0,format=raw,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=/var/lib/libvirt/images/test.sio,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=26,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:7e:b1:46,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -device AC97,id=sound0,bus=pci.0,addr=0x4 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
root      6793  6774  0 05:36 pts/3    00:00:00 grep qemu

6.# kill -9 2038

7.Get log from "/var/lib/libvirt/images/libvirtd.log" file
..............................
05:37:36.310: 6718: debug : qemuMonitorIO:580 : Triggering EOF callback error? 0
05:37:36.310: 6718: debug : qemuHandleMonitorEOF:1306 : Received EOF on 0x1bcaec0 'http_test'
05:37:36.310: 6718: debug : qemuHandleMonitorEOF:1313 : Monitor connection to 'http_test' closed without SHUTDOWN event; assuming the domain crashed
05:37:36.310: 6718: debug : qemudShutdownVMDaemon:4748 : Shutting down VM 'http_test' pid=2038 migrated=0
05:37:36.311: 6718: debug : qemuMonitorClose:694 : mon=0x1d43000
05:37:36.311: 6718: debug : virEventRemoveHandleImpl:163 : Remove handle w=5
05:37:36.311: 6718: debug : virEventRemoveHandleImpl:176 : mark delete 4 21
05:37:36.311: 6718: debug : virEventInterruptLocked:655 : Skip interrupt, 1 356493072
05:37:36.311: 6718: debug : qemuSecurityDACRestoreSecurityAllLabel:426 : Restoring security label on http_test migrated=0
05:37:36.311: 6718: info : qemuSecurityDACRestoreSecurityFileLabel:80 : Restoring DAC user and group on '/var/lib/libvirt/images/http_test.img'
05:37:36.311: 6718: info : qemuSecurityDACSetOwnership:40 : Setting DAC user and group on '/var/lib/libvirt/images/http_test.img' to '0:0'
05:37:36.311: 6718: debug : SELinuxRestoreSecurityAllLabel:746 : Restoring security label on http_test
05:37:36.311: 6718: info : SELinuxRestoreSecurityFileLabel:369 : Restoring SELinux context on '/var/lib/libvirt/images/http_test.img'
05:37:36.385: 6718: info : SELinuxSetFilecon:323 : Setting SELinux context on '/var/lib/libvirt/images/http_test.img' to 'system_u:object_r:virt_image_t:s0'
05:37:36.385: 6718: debug : virCgroupNew:555 : New group /libvirt/qemu/http_test
05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 0:cpu at /cgroup/cpu in
05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 1:cpuacct at /cgroup/cpuacct in
05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 2:cpuset at /cgroup/cpuset in
05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 3:memory at /cgroup/memory in
05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 4:devices at /cgroup/devices in
05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 5:freezer at /cgroup/freezer in
05:37:36.388: 6718: debug : virCgroupMakeGroup:497 : Make group /libvirt/qemu/http_test
............................

Comment 20 Haim 2011-02-02 17:04:09 UTC

verified on: 

vdsm-4.9-47.el6.x86_64
libvirt-0.8.7-4.el6.x86_64
qemu-kvm-0.12.1.2-2.129.el6.x86_64

scenario:

[root@rhev-i32c-01 core]# kill -9 24030
[root@rhev-i32c-01 core]# libvirtEventLoop::DEBUG::2011-02-02 19:03:10,285::vm::1755::vds.vmlog.e65edc8a-5c7f-4157-86e3-82e692e64adc::(setDownStatus) Changed state to Down: Lost connection with kvm process
Thread-1247::DEBUG::2011-02-02 19:03:12,084::clientIF::46::vds::(wrapper) return getVmStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Down', 'timeOffset': '-1', 'vmId': 'e65edc8a-5c7f-4157-86e3-82e692e64adc', 'exitMessage': 'Lost connection with kvm process', 'exitCode': 1}]}
Thread-1248::WARNING::2011-02-02 19:03:12,560::vm::568::vds.vmlog.e65edc8a-5c7f-4157-86e3-82e692e64adc::(_set_lastStatus) trying to set state to Powering down when already Down

Comment 21 Martin Prpič 2011-04-15 14:21:34 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Prior to this update, libvirt was not able to recognize whether a domain crashed or was properly shut down. With this update, a SHUTDOWN event sent by qemu is recognized by libvirt when a domain is properly shut down. If the SHUTDOWN event is not received, the domain is declared to have crashed.

Comment 24 Laura Bailey 2011-05-04 04:55:28 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Prior to this update, libvirt was not able to recognize whether a domain crashed or was properly shut down. With this update, a SHUTDOWN event sent by qemu is recognized by libvirt when a domain is properly shut down. If the SHUTDOWN event is not received, the domain is declared to have crashed.+libvirt could not determine whether a domain had crashed or been correctly shut down. This update adds recognition of the SHUTDOWN event sent by qemu when a server is shut down correctly. If this event is not received, the domain is now declared to have crashed.

Comment 25 errata-xmlrpc 2011-05-19 13:24:29 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html