Bug 531983
Summary: | libvirtd will get hang forever while qemu got hang with kvm hypervisor | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | wmg <wezhang> | ||||
Component: | libvirt | Assignee: | Daniel Veillard <veillard> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.4 | CC: | berrange, djuran, dwu, dyuan, herrold, jdenemar, jtluka, markmc, mjenner, tao, virt-maint, xen-maint, ydu | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-0.8.2-1.el5 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 647844 (view as bug list) | Environment: | |||||
Last Closed: | 2011-01-13 22:52:09 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
wmg
2009-10-30 05:45:40 UTC
as the versions I mentioned in above, this also affect the libvirt package in fedora 11 FYI, latest upstream libvirt will now timeout after 30 seconds of inactivity. This change was hugely invasive code though, so I don't think there is any chance of backporting this to RHEL-5 libvirt. It will be included in RHEL-6 though. This is the upstream patch series that addressed it http://www.redhat.com/archives/libvir-list/2009-November/msg00083.html (In reply to comment #2) > FYI, latest upstream libvirt will now timeout after 30 seconds of inactivity. > This change was hugely invasive code though, so I don't think there is any > chance of backporting this to RHEL-5 libvirt. It will be included in RHEL-6 > though. may 30 seconds is too long? 30 seconds is appropriate for this. If it were any lower then it would cause unneccessary timeouts when the host is under high load and QEMU is slow to respond. Having a completely dead QEMU is a unusual occurrence and not something we want to optimize for. Similarly we do not want to optimize for scenario of someone deliberately sending SIGSTOP to the process. I'm afraid that patchset is really too heavy for full backporting to RHEL-5 and trying to split things out would make a lot of untested code, so I think it's reasonnable to kill this for RHEL-5 and this being a RHEL-6 only fix. If qemu-kvm dies in RHEL-5, libvirtd won't recover, nasty but hard to avoid at this point. Daniel Customer hope it could be fixed on RHEL5 Fixed in libvirt-0.8.2-1.el5 On rhel5.6-server-x86_64, the bug still exists. # rpm -q libvirt libvirt-0.8.2-8.el5 Reproduce steps: Steps to Reproduce: 1. setup a virt guest with qemu-kvm, and start it 2. stop qemu process with following: kill -STOP `ps aux | grep qemu | grep -v grep | awk '{print $2}'` 3. run the following command: virsh list ]# virsh list Id Name State ---------------------------------- Actual results: virsh list will get hang there Please install libvirt-debuginfo, attach 'gdb' to the libvirtd process and run 'thread apply all backtrace' and attach the log to this bug. Created attachment 456439 [details]
log info
This GDB log doesn't show any evidence of libvirtd itself hanging, just one guest has hung. So I think you're mis-understanding what problem this bug is attempting to solve. If you kill -STOP a particular QEMU process, then libvirt will no longer get any response to monitor commands from that process. Any libvirt API that is invoked against that QEMU guest will hang if it requires a monitor command. The problem was that a hang on this one guest, would also cause a hang on all other libvirt guests which are still running normally. What we solved in this bug, is that you can continue to request information about *other* guests, even when this one guest has hung. We also added a special case to virDomainDumpXML and virDomainGetInfo, so that it skips running any monitor command if there is one that has hung. So the first 'virsh list' you run will *still* hang. You should still be able to run other virsh/libvirt APIs against *different* guests, or against things like storage, networking, etc. But isn't the correct response as simple as setting a 30 second timeout on each select() read block from: man 2 select int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); This does not seem invasive No, that would break many commands which are expected to take longer than 30 seconds to complete. ehh? your comment 5 says it is appropriate for the 'virsh list' command-set which is the one hanging adding the timeout and error notification to virsh list is what is proposed, not some general roll-in of a 'one size fits all' timeout -- Russ herrold The timeout described in comment #5 is relating to timeout while waiting on a lock in another part of the code, not the timing out of currently executing monitor commands. Regardless, this isn't the place for design & implementation discussions. Verified this bug PASSED with libvirt-0.8.2-10.el5 and removed the needinfo flag. # ps aux | grep qemu | grep -v grep | awk '{print $2}' 4703 4742 # virsh list --all Id Name State ---------------------------------- 3 rhel55 running 4 rhel55-1 running # kill -STOP 4742 # virsh list --all Id Name State ---------------------------------- 3 rhel55 running ^C According to comment 17, a hang is only for guest rhel55-1 but not for libvirtd. No hang for the following virsh commands. # virsh list --inactive Id Name State ---------------------------------- # virsh net-list --all Name State Autostart ----------------------------------------- default active yes *** Bug 625428 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0060.html *** Bug 647845 has been marked as a duplicate of this bug. *** |