Bug 531983

Summary: libvirtd will get hang forever while qemu got hang with kvm hypervisor
Product: Red Hat Enterprise Linux 5 Reporter: wmg <wezhang>
Component: libvirtAssignee: Daniel Veillard <veillard>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: low    
Version: 5.4CC: berrange, djuran, dwu, dyuan, herrold, jdenemar, jtluka, markmc, mjenner, tao, virt-maint, xen-maint, ydu
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.8.2-1.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 647844 (view as bug list) Environment:
Last Closed: 2011-01-13 22:52:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log info none

Description wmg 2009-10-30 05:45:40 UTC
Description of problem:



libvirtd will get hang forever while qemu got hang 

Version-Release number of selected component (if applicable):

[root@intel-sunriseridge-02 das]# rpm -q --qf '%{n}-%{v}-%{r}.%{arch}\n' libvirt 
libvirt-0.6.3-20.el5.i386
libvirt-0.6.3-20.el5.x86_64


[root@ht ~]# rpm -q --qf '%{n}-%{v}-%{r}.%{arch}\n'  libvirt
libvirt-0.6.2-18.fc11.x86_64



How reproducible:

100%


Steps to Reproduce:
1. setup a virt guest with qemu-kvm, and start it 
2. stop qemu process with following:
 kill -STOP  `ps aux | grep qemu | grep -v grep | awk '{print $2}'`
3. run the following command:
virsh list
  
Actual results:

virsh list will get hang there forever


Expected results:

virsh list should return a list of running vm list with some unknown status of that stopped process


Additional info:

for some situations while qemu got hang or stopped, and seems libvirt is blocked on the read operations of the qemu status
this shouldn't be the right behavior of a virt management program

Comment 1 wmg 2009-10-30 05:47:11 UTC
as the versions I mentioned in above, this also affect the libvirt package in fedora 11

Comment 2 Daniel Berrangé 2009-11-11 12:33:58 UTC
FYI, latest upstream libvirt will now timeout after 30 seconds of inactivity. This change was hugely invasive code though, so I don't think there is any chance of backporting this to RHEL-5 libvirt.  It will be included in RHEL-6 though.

Comment 3 Daniel Berrangé 2009-11-11 12:34:39 UTC
This is the upstream patch series that addressed it

http://www.redhat.com/archives/libvir-list/2009-November/msg00083.html

Comment 4 wmg 2009-11-11 12:53:40 UTC
(In reply to comment #2)
> FYI, latest upstream libvirt will now timeout after 30 seconds of inactivity.
> This change was hugely invasive code though, so I don't think there is any
> chance of backporting this to RHEL-5 libvirt.  It will be included in RHEL-6
> though.  

may 30 seconds is too long?

Comment 5 Daniel Berrangé 2009-11-11 14:24:24 UTC
30 seconds is appropriate for this. If it were any lower then it would cause unneccessary timeouts when the host is under high load and QEMU is slow to respond.  Having a completely dead QEMU is a unusual occurrence and not something we want to optimize for. Similarly we do not want to optimize for scenario of someone deliberately sending SIGSTOP to the process.

Comment 6 Daniel Veillard 2009-12-16 15:32:49 UTC
I'm afraid that patchset is really too heavy for full backporting to RHEL-5
and trying to split things out would make a lot of untested code, so
I think it's reasonnable to kill this for RHEL-5 and this being a RHEL-6
only fix. If qemu-kvm dies in RHEL-5, libvirtd won't recover, nasty but
hard to avoid at this point. 

Daniel

Comment 8 Mark Wu 2010-07-28 02:58:12 UTC
Customer hope it could be fixed on RHEL5

Comment 11 Jiri Denemark 2010-09-02 11:57:31 UTC
Fixed in libvirt-0.8.2-1.el5

Comment 14 yanbing du 2010-10-26 10:09:36 UTC
On rhel5.6-server-x86_64, the bug still exists.
# rpm -q libvirt
libvirt-0.8.2-8.el5
Reproduce steps:
Steps to Reproduce:
1. setup a virt guest with qemu-kvm, and start it 
2. stop qemu process with following:
 kill -STOP  `ps aux | grep qemu | grep -v grep | awk '{print $2}'`
3. run the following command:
virsh list
]# virsh list
 Id Name                 State
----------------------------------


Actual results:

virsh list will get hang there

Comment 15 Daniel Berrangé 2010-10-29 09:56:30 UTC
Please install libvirt-debuginfo, attach 'gdb' to the libvirtd process and run 'thread apply all backtrace' and attach the log to this bug.

Comment 16 weizhang 2010-10-29 11:19:49 UTC
Created attachment 456439 [details]
log info

Comment 17 Daniel Berrangé 2010-10-29 11:37:35 UTC
This GDB log doesn't show any evidence of libvirtd itself hanging, just one guest has hung. So I think you're mis-understanding what problem this bug is attempting to solve.

If you kill -STOP a particular QEMU process, then libvirt will no longer get any response to monitor commands from that process. Any libvirt API that is invoked against that QEMU guest will hang if it requires a monitor command.

The problem was that a hang on this one guest, would also cause a hang on all other libvirt guests which are still running normally.

What we solved in this bug, is that you can continue to request information about *other* guests, even when this one guest has hung.

We also added a special case to virDomainDumpXML and virDomainGetInfo, so that it skips running any monitor command if there is one that has hung.

So the first 'virsh list' you run will *still* hang.

You should still be able to run other virsh/libvirt APIs against *different* guests, or against things like storage, networking, etc.

Comment 18 R P Herrold 2010-10-29 13:10:09 UTC
But isn't the correct response as simple as setting a 30 second timeout on each select() read block 

from: man 2 select

int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);

This does not seem invasive

Comment 19 Daniel Berrangé 2010-10-29 13:13:57 UTC
No, that would break many commands which are expected to take longer than 30 seconds to complete.

Comment 20 R P Herrold 2010-10-29 15:49:06 UTC
ehh?  your comment 5 says it is appropriate for the 'virsh list' command-set which is the one hanging

adding the timeout and error notification to 
   virsh list
is what is proposed, not some general roll-in of a 'one size fits all' timeout

-- Russ herrold

Comment 21 Daniel Berrangé 2010-10-29 15:59:18 UTC
The timeout described in comment #5 is relating to timeout while waiting on a lock in another part of the code, not the timing out of currently executing monitor commands. Regardless, this isn't the place for design & implementation discussions.

Comment 22 dyuan 2010-10-30 15:05:33 UTC
Verified this bug PASSED with libvirt-0.8.2-10.el5 and removed the needinfo flag.

# ps aux | grep qemu | grep -v grep | awk '{print $2}'
4703
4742

# virsh list --all
 Id Name                 State
----------------------------------
  3 rhel55               running
  4 rhel55-1             running

# kill -STOP 4742

# virsh list --all
 Id Name                 State
----------------------------------
  3 rhel55               running

^C

According to comment 17, a hang is only for guest rhel55-1 but not for libvirtd.

No hang for the following virsh commands.

# virsh list --inactive
 Id Name                 State
----------------------------------

# virsh net-list --all
Name                 State      Autostart
-----------------------------------------
default              active     yes

Comment 24 David Juran 2010-12-29 10:11:14 UTC
*** Bug 625428 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2011-01-13 22:52:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0060.html

Comment 27 Dave Allan 2011-02-01 02:18:11 UTC
*** Bug 647845 has been marked as a duplicate of this bug. ***