This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 531983 - libvirtd will get hang forever while qemu got hang with kvm hypervisor
libvirtd will get hang forever while qemu got hang with kvm hypervisor
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: libvirt (Show other bugs)
5.4
All Linux
low Severity high
: rc
: ---
Assigned To: Daniel Veillard
Virtualization Bugs
:
: 625428 647845 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-10-30 01:45 EDT by wmg
Modified: 2011-01-31 21:18 EST (History)
13 users (show)

See Also:
Fixed In Version: libvirt-0.8.2-1.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 647844 (view as bug list)
Environment:
Last Closed: 2011-01-13 17:52:09 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
log info (10.47 KB, text/plain)
2010-10-29 07:19 EDT, weizhang
no flags Details

  None (edit)
Description wmg 2009-10-30 01:45:40 EDT
Description of problem:



libvirtd will get hang forever while qemu got hang 

Version-Release number of selected component (if applicable):

[root@intel-sunriseridge-02 das]# rpm -q --qf '%{n}-%{v}-%{r}.%{arch}\n' libvirt 
libvirt-0.6.3-20.el5.i386
libvirt-0.6.3-20.el5.x86_64


[root@ht ~]# rpm -q --qf '%{n}-%{v}-%{r}.%{arch}\n'  libvirt
libvirt-0.6.2-18.fc11.x86_64



How reproducible:

100%


Steps to Reproduce:
1. setup a virt guest with qemu-kvm, and start it 
2. stop qemu process with following:
 kill -STOP  `ps aux | grep qemu | grep -v grep | awk '{print $2}'`
3. run the following command:
virsh list
  
Actual results:

virsh list will get hang there forever


Expected results:

virsh list should return a list of running vm list with some unknown status of that stopped process


Additional info:

for some situations while qemu got hang or stopped, and seems libvirt is blocked on the read operations of the qemu status
this shouldn't be the right behavior of a virt management program
Comment 1 wmg 2009-10-30 01:47:11 EDT
as the versions I mentioned in above, this also affect the libvirt package in fedora 11
Comment 2 Daniel Berrange 2009-11-11 07:33:58 EST
FYI, latest upstream libvirt will now timeout after 30 seconds of inactivity. This change was hugely invasive code though, so I don't think there is any chance of backporting this to RHEL-5 libvirt.  It will be included in RHEL-6 though.
Comment 3 Daniel Berrange 2009-11-11 07:34:39 EST
This is the upstream patch series that addressed it

http://www.redhat.com/archives/libvir-list/2009-November/msg00083.html
Comment 4 wmg 2009-11-11 07:53:40 EST
(In reply to comment #2)
> FYI, latest upstream libvirt will now timeout after 30 seconds of inactivity.
> This change was hugely invasive code though, so I don't think there is any
> chance of backporting this to RHEL-5 libvirt.  It will be included in RHEL-6
> though.  

may 30 seconds is too long?
Comment 5 Daniel Berrange 2009-11-11 09:24:24 EST
30 seconds is appropriate for this. If it were any lower then it would cause unneccessary timeouts when the host is under high load and QEMU is slow to respond.  Having a completely dead QEMU is a unusual occurrence and not something we want to optimize for. Similarly we do not want to optimize for scenario of someone deliberately sending SIGSTOP to the process.
Comment 6 Daniel Veillard 2009-12-16 10:32:49 EST
I'm afraid that patchset is really too heavy for full backporting to RHEL-5
and trying to split things out would make a lot of untested code, so
I think it's reasonnable to kill this for RHEL-5 and this being a RHEL-6
only fix. If qemu-kvm dies in RHEL-5, libvirtd won't recover, nasty but
hard to avoid at this point. 

Daniel
Comment 8 Mark Wu 2010-07-27 22:58:12 EDT
Customer hope it could be fixed on RHEL5
Comment 11 Jiri Denemark 2010-09-02 07:57:31 EDT
Fixed in libvirt-0.8.2-1.el5
Comment 14 yanbing du 2010-10-26 06:09:36 EDT
On rhel5.6-server-x86_64, the bug still exists.
# rpm -q libvirt
libvirt-0.8.2-8.el5
Reproduce steps:
Steps to Reproduce:
1. setup a virt guest with qemu-kvm, and start it 
2. stop qemu process with following:
 kill -STOP  `ps aux | grep qemu | grep -v grep | awk '{print $2}'`
3. run the following command:
virsh list
]# virsh list
 Id Name                 State
----------------------------------


Actual results:

virsh list will get hang there
Comment 15 Daniel Berrange 2010-10-29 05:56:30 EDT
Please install libvirt-debuginfo, attach 'gdb' to the libvirtd process and run 'thread apply all backtrace' and attach the log to this bug.
Comment 16 weizhang 2010-10-29 07:19:49 EDT
Created attachment 456439 [details]
log info
Comment 17 Daniel Berrange 2010-10-29 07:37:35 EDT
This GDB log doesn't show any evidence of libvirtd itself hanging, just one guest has hung. So I think you're mis-understanding what problem this bug is attempting to solve.

If you kill -STOP a particular QEMU process, then libvirt will no longer get any response to monitor commands from that process. Any libvirt API that is invoked against that QEMU guest will hang if it requires a monitor command.

The problem was that a hang on this one guest, would also cause a hang on all other libvirt guests which are still running normally.

What we solved in this bug, is that you can continue to request information about *other* guests, even when this one guest has hung.

We also added a special case to virDomainDumpXML and virDomainGetInfo, so that it skips running any monitor command if there is one that has hung.

So the first 'virsh list' you run will *still* hang.

You should still be able to run other virsh/libvirt APIs against *different* guests, or against things like storage, networking, etc.
Comment 18 R P Herrold 2010-10-29 09:10:09 EDT
But isn't the correct response as simple as setting a 30 second timeout on each select() read block 

from: man 2 select

int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);

This does not seem invasive
Comment 19 Daniel Berrange 2010-10-29 09:13:57 EDT
No, that would break many commands which are expected to take longer than 30 seconds to complete.
Comment 20 R P Herrold 2010-10-29 11:49:06 EDT
ehh?  your comment 5 says it is appropriate for the 'virsh list' command-set which is the one hanging

adding the timeout and error notification to 
   virsh list
is what is proposed, not some general roll-in of a 'one size fits all' timeout

-- Russ herrold
Comment 21 Daniel Berrange 2010-10-29 11:59:18 EDT
The timeout described in comment #5 is relating to timeout while waiting on a lock in another part of the code, not the timing out of currently executing monitor commands. Regardless, this isn't the place for design & implementation discussions.
Comment 22 dyuan 2010-10-30 11:05:33 EDT
Verified this bug PASSED with libvirt-0.8.2-10.el5 and removed the needinfo flag.

# ps aux | grep qemu | grep -v grep | awk '{print $2}'
4703
4742

# virsh list --all
 Id Name                 State
----------------------------------
  3 rhel55               running
  4 rhel55-1             running

# kill -STOP 4742

# virsh list --all
 Id Name                 State
----------------------------------
  3 rhel55               running

^C

According to comment 17, a hang is only for guest rhel55-1 but not for libvirtd.

No hang for the following virsh commands.

# virsh list --inactive
 Id Name                 State
----------------------------------

# virsh net-list --all
Name                 State      Autostart
-----------------------------------------
default              active     yes
Comment 24 David Juran 2010-12-29 05:11:14 EST
*** Bug 625428 has been marked as a duplicate of this bug. ***
Comment 26 errata-xmlrpc 2011-01-13 17:52:09 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0060.html
Comment 27 Dave Allan 2011-01-31 21:18:11 EST
*** Bug 647845 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.