Bug 656845
Summary: | [libvirt] kill qemu process using 'kill' behaves as user shutdown command and not as 'lost connection with qemu process' | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Haim <hateya> |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 6.1 | CC: | abaron, berrange, cpelland, dallan, danken, dyuan, eblake, hateya, iheim, jyang, kxiong, mgoldboi, mjenner, plyons, xen-maint, yeylon, yimwang, ykaul |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-0.8.6-1.el6 | Doc Type: | Bug Fix |
Doc Text: |
libvirt could not determine whether a domain had crashed or been correctly shut down. This update adds recognition of the SHUTDOWN event sent by qemu when a server is shut down correctly. If this event is not received, the domain is now declared to have crashed.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-05-19 13:24:29 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 662042 |
Description
Haim
2010-11-24 10:08:03 UTC
Does the behavior differ if you send SIGKILL? This is not as trivial as it sounds to be. If the domain is killed with virsh destroy, libvirt knows it killed the domain and appropriate event can be emitted easily. On the other hand, if qemu process just vanishes we have to guess externally what happened to the domain. However, we don't have a large set of possible ways to detect that: (1) waitpid(), which is probably what vdsm uses, is the most reliable option but unusable because libvirtd starts qemu processes daemonized with init being their parent. Even if we changed libvirt to be a parent of all qemu processes it starts, restarting libvirtd would make init their parent and we wouldn't be able to use waitpid anyway. (2) relying on powerdown event from qemu and if we don't see that event before monitor connection is closed, we could treat it as abnormal termination. Do we have another option? > (1) waitpid(), which is probably what vdsm uses, is the most reliable option This is not an option. We explicitly don't want libvirt to be a parent of any VM processes for stability/reliability reasons. > (2) relying on powerdown event from qemu and if we don't see that event before > monitor connection is closed, we could treat it as abnormal termination. This is reasonable. We should always get an event unless QEMU was horribly killed, or crashed. Either way IMHO this is not a priority, since directly killing QEMU PIDs is unnecessary, libvirt provides an API to do this in a supportable manner. (In reply to comment #4) > Either way IMHO this is not a priority, since directly killing QEMU PIDs is > unnecessary, libvirt provides an API to do this in a supportable manner. What we fear is qemu crash. When this happens, RHEV-M-2.2 has a feature that restarts the VM on another host (if the VM is defined as "highly-available"). If qemu is shut down politely (within guest), this should not happen. We must distinguish orderly shutdowns from crashes. Dan Kenigsberg, what mechanism do you currently use to distinguish crashes from orderly shutdowns? (In reply to comment #10) > Dan Kenigsberg, what mechanism do you currently use to distinguish crashes from > orderly shutdowns? In RHEL5 vdsm listens to qemu's shutdown/powerdown events. If they are received before qemu process disappears, we consider this an orderly shutdown. Anything else is a crash (unless initiated by vdsm). AFAICT that's Jiří's option (2) in comment 3. (vdsm has the same issue with waitpid as libvirt) Fixed upstream by v0.8.6-71-gc778fe9: commit c778fe967808eb2426ed4851db3ec49a0cdc76ca Author: Jiri Denemark <jdenemar> Date: Thu Dec 9 11:18:32 2010 +0100 qemu: Distinguish between domain shutdown and crash When we get an EOF event on monitor connection, it may be a result of either crash or graceful shutdown. QEMU which supports async events (i.e., we are talking to it using JSON monitor) emits SHUTDOWN event on graceful shutdown. In case we don't get this event by the time monitor connection is closed, we assume the associated domain crashed. verified it PASSED on build : libvirt-0.8.1-29.el6.x86_64 libvirt-client-0.8.1-29.el6.x86_64 qemu-kvm-0.12.1.2-2.128.el6.x86_64 qemu-img-0.12.1.2-2.128.el6.x86_64 kernel-2.6.32-93.el6.x86_64 Steps: 1.After kill the qemu process,I can get the info from "linvirtd.log": ............................... 10:03:26.051: debug : qemuHandleMonitorEOF:1122 : Received EOF on 0x1901ef0 'vm11' 10:03:26.051: debug : qemuHandleMonitorEOF:1128 : Monitor connection to 'vm11' closed without SHUTDOWN event; assuming the domain crashed 10:03:26.051: debug : qemudShutdownVMDaemon:4289 : Shutting down VM 'vm11' migrated=0 10:03:26.051: debug : qemuMonitorClose:690 : mon=0x1933ca0 10:03:26.051: debug : virEventRemoveHandleImpl:174 : Remove handle w=5 10:03:26.051: debug : virEventRemoveHandleImpl:187 : mark delete 4 16 10:03:26.051: debug : virEventInterruptLocked:664 : Skip interrupt, 1 732268304 10:03:26.052: debug : qemuSecurityDACRestoreSecurityAllLabel:426 : Restoring security label on vm11 migrated=0 10:03:26.052: info : qemuSecurityDACRestoreSecurityFileLabel:80 : Restoring DAC user and group on '/mnt/yimwang/vm11.img' 10:03:26.052: info : qemuSecurityDACSetOwnership:40 : Setting DAC user and group on '/mnt/yimwang/vm11.img' to '0:0' 10:03:26.052: debug : SELinuxRestoreSecurityAllLabel:737 : Restoring security label on vm11 10:03:26.052: info : SELinuxRestoreSecurityFileLabel:367 : Restoring SELinux context on '/mnt/yimwang/vm11.img' ......................................... Verified PASSED with libvirt-0.8.6-1.el6. Detail steps for comment 16: Steps: 1.Define and start a domain. # virsh start http_test Domain http_test started 2.Edit "libvirtd.conf" file # vi /etc/libvirt/libvirtd.conf log_level = 1 log_outputs="1:file:/var/lib/libvirt/images/libvirtd.log" 3.# service libvirtd stop 4.# libvirtd 5.Open the second terminal. # ps -ef|grep qemu qemu 2038 1 1 04:31 ? 00:00:56 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name http_test -uuid 3d3a297b-8039-7143-033a-7ca7d9feb676 -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/http_test.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/var/lib/libvirt/images/http_test.img,if=none,id=drive-ide0-0-0,format=raw,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=/var/lib/libvirt/images/test.sio,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=26,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:7e:b1:46,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -device AC97,id=sound0,bus=pci.0,addr=0x4 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 root 6793 6774 0 05:36 pts/3 00:00:00 grep qemu 6.# kill -9 2038 7.Get log from "/var/lib/libvirt/images/libvirtd.log" file .............................. 05:37:36.310: 6718: debug : qemuMonitorIO:580 : Triggering EOF callback error? 0 05:37:36.310: 6718: debug : qemuHandleMonitorEOF:1306 : Received EOF on 0x1bcaec0 'http_test' 05:37:36.310: 6718: debug : qemuHandleMonitorEOF:1313 : Monitor connection to 'http_test' closed without SHUTDOWN event; assuming the domain crashed 05:37:36.310: 6718: debug : qemudShutdownVMDaemon:4748 : Shutting down VM 'http_test' pid=2038 migrated=0 05:37:36.311: 6718: debug : qemuMonitorClose:694 : mon=0x1d43000 05:37:36.311: 6718: debug : virEventRemoveHandleImpl:163 : Remove handle w=5 05:37:36.311: 6718: debug : virEventRemoveHandleImpl:176 : mark delete 4 21 05:37:36.311: 6718: debug : virEventInterruptLocked:655 : Skip interrupt, 1 356493072 05:37:36.311: 6718: debug : qemuSecurityDACRestoreSecurityAllLabel:426 : Restoring security label on http_test migrated=0 05:37:36.311: 6718: info : qemuSecurityDACRestoreSecurityFileLabel:80 : Restoring DAC user and group on '/var/lib/libvirt/images/http_test.img' 05:37:36.311: 6718: info : qemuSecurityDACSetOwnership:40 : Setting DAC user and group on '/var/lib/libvirt/images/http_test.img' to '0:0' 05:37:36.311: 6718: debug : SELinuxRestoreSecurityAllLabel:746 : Restoring security label on http_test 05:37:36.311: 6718: info : SELinuxRestoreSecurityFileLabel:369 : Restoring SELinux context on '/var/lib/libvirt/images/http_test.img' 05:37:36.385: 6718: info : SELinuxSetFilecon:323 : Setting SELinux context on '/var/lib/libvirt/images/http_test.img' to 'system_u:object_r:virt_image_t:s0' 05:37:36.385: 6718: debug : virCgroupNew:555 : New group /libvirt/qemu/http_test 05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 0:cpu at /cgroup/cpu in 05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 1:cpuacct at /cgroup/cpuacct in 05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 2:cpuset at /cgroup/cpuset in 05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 3:memory at /cgroup/memory in 05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 4:devices at /cgroup/devices in 05:37:36.388: 6718: debug : virCgroupDetect:245 : Detected mount/mapping 5:freezer at /cgroup/freezer in 05:37:36.388: 6718: debug : virCgroupMakeGroup:497 : Make group /libvirt/qemu/http_test ............................ verified on: vdsm-4.9-47.el6.x86_64 libvirt-0.8.7-4.el6.x86_64 qemu-kvm-0.12.1.2-2.129.el6.x86_64 scenario: [root@rhev-i32c-01 core]# kill -9 24030 [root@rhev-i32c-01 core]# libvirtEventLoop::DEBUG::2011-02-02 19:03:10,285::vm::1755::vds.vmlog.e65edc8a-5c7f-4157-86e3-82e692e64adc::(setDownStatus) Changed state to Down: Lost connection with kvm process Thread-1247::DEBUG::2011-02-02 19:03:12,084::clientIF::46::vds::(wrapper) return getVmStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Down', 'timeOffset': '-1', 'vmId': 'e65edc8a-5c7f-4157-86e3-82e692e64adc', 'exitMessage': 'Lost connection with kvm process', 'exitCode': 1}]} Thread-1248::WARNING::2011-02-02 19:03:12,560::vm::568::vds.vmlog.e65edc8a-5c7f-4157-86e3-82e692e64adc::(_set_lastStatus) trying to set state to Powering down when already Down Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Prior to this update, libvirt was not able to recognize whether a domain crashed or was properly shut down. With this update, a SHUTDOWN event sent by qemu is recognized by libvirt when a domain is properly shut down. If the SHUTDOWN event is not received, the domain is declared to have crashed. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Prior to this update, libvirt was not able to recognize whether a domain crashed or was properly shut down. With this update, a SHUTDOWN event sent by qemu is recognized by libvirt when a domain is properly shut down. If the SHUTDOWN event is not received, the domain is declared to have crashed.+libvirt could not determine whether a domain had crashed or been correctly shut down. This update adds recognition of the SHUTDOWN event sent by qemu when a server is shut down correctly. If this event is not received, the domain is now declared to have crashed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0596.html |