Bug 531983

Summary:

libvirtd will get hang forever while qemu got hang with kvm hypervisor

Product:

Red Hat Enterprise Linux 5

Reporter:

wmg <wezhang>

Component:

libvirt

Assignee:

Daniel Veillard <veillard>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

high

Docs Contact:

Priority:

low

Version:

5.4

CC:

berrange, djuran, dwu, dyuan, herrold, jdenemar, jtluka, markmc, mjenner, tao, virt-maint, xen-maint, ydu

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

libvirt-0.8.2-1.el5

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

647844 (view as bug list)

Environment:

Last Closed:

2011-01-13 22:52:09 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
log info	none

Description wmg 2009-10-30 05:45:40 UTC

Description of problem:



libvirtd will get hang forever while qemu got hang 

Version-Release number of selected component (if applicable):

[root@intel-sunriseridge-02 das]# rpm -q --qf '%{n}-%{v}-%{r}.%{arch}\n' libvirt 
libvirt-0.6.3-20.el5.i386
libvirt-0.6.3-20.el5.x86_64


[root@ht ~]# rpm -q --qf '%{n}-%{v}-%{r}.%{arch}\n'  libvirt
libvirt-0.6.2-18.fc11.x86_64



How reproducible:

100%


Steps to Reproduce:
1. setup a virt guest with qemu-kvm, and start it 
2. stop qemu process with following:
 kill -STOP  `ps aux | grep qemu | grep -v grep | awk '{print $2}'`
3. run the following command:
virsh list
  
Actual results:

virsh list will get hang there forever


Expected results:

virsh list should return a list of running vm list with some unknown status of that stopped process


Additional info:

for some situations while qemu got hang or stopped, and seems libvirt is blocked on the read operations of the qemu status
this shouldn't be the right behavior of a virt management program

Comment 1 wmg 2009-10-30 05:47:11 UTC

as the versions I mentioned in above, this also affect the libvirt package in fedora 11

Comment 2 Daniel Berrangé 2009-11-11 12:33:58 UTC

FYI, latest upstream libvirt will now timeout after 30 seconds of inactivity. This change was hugely invasive code though, so I don't think there is any chance of backporting this to RHEL-5 libvirt.  It will be included in RHEL-6 though.

Comment 3 Daniel Berrangé 2009-11-11 12:34:39 UTC

This is the upstream patch series that addressed it

http://www.redhat.com/archives/libvir-list/2009-November/msg00083.html

Comment 4 wmg 2009-11-11 12:53:40 UTC

(In reply to comment #2)
> FYI, latest upstream libvirt will now timeout after 30 seconds of inactivity.
> This change was hugely invasive code though, so I don't think there is any
> chance of backporting this to RHEL-5 libvirt.  It will be included in RHEL-6
> though.  

may 30 seconds is too long?

Comment 5 Daniel Berrangé 2009-11-11 14:24:24 UTC

30 seconds is appropriate for this. If it were any lower then it would cause unneccessary timeouts when the host is under high load and QEMU is slow to respond.  Having a completely dead QEMU is a unusual occurrence and not something we want to optimize for. Similarly we do not want to optimize for scenario of someone deliberately sending SIGSTOP to the process.

Comment 6 Daniel Veillard 2009-12-16 15:32:49 UTC

I'm afraid that patchset is really too heavy for full backporting to RHEL-5
and trying to split things out would make a lot of untested code, so
I think it's reasonnable to kill this for RHEL-5 and this being a RHEL-6
only fix. If qemu-kvm dies in RHEL-5, libvirtd won't recover, nasty but
hard to avoid at this point. 

Daniel

Comment 8 Mark Wu 2010-07-28 02:58:12 UTC

Customer hope it could be fixed on RHEL5

Comment 11 Jiri Denemark 2010-09-02 11:57:31 UTC

Fixed in libvirt-0.8.2-1.el5

Comment 14 yanbing du 2010-10-26 10:09:36 UTC

On rhel5.6-server-x86_64, the bug still exists.
# rpm -q libvirt
libvirt-0.8.2-8.el5
Reproduce steps:
Steps to Reproduce:
1. setup a virt guest with qemu-kvm, and start it 
2. stop qemu process with following:
 kill -STOP  `ps aux | grep qemu | grep -v grep | awk '{print $2}'`
3. run the following command:
virsh list
]# virsh list
 Id Name                 State
----------------------------------


Actual results:

virsh list will get hang there

Comment 15 Daniel Berrangé 2010-10-29 09:56:30 UTC

Please install libvirt-debuginfo, attach 'gdb' to the libvirtd process and run 'thread apply all backtrace' and attach the log to this bug.

Comment 16 weizhang 2010-10-29 11:19:49 UTC

Created attachment 456439 [details]
log info

Comment 17 Daniel Berrangé 2010-10-29 11:37:35 UTC

This GDB log doesn't show any evidence of libvirtd itself hanging, just one guest has hung. So I think you're mis-understanding what problem this bug is attempting to solve.

If you kill -STOP a particular QEMU process, then libvirt will no longer get any response to monitor commands from that process. Any libvirt API that is invoked against that QEMU guest will hang if it requires a monitor command.

The problem was that a hang on this one guest, would also cause a hang on all other libvirt guests which are still running normally.

What we solved in this bug, is that you can continue to request information about *other* guests, even when this one guest has hung.

We also added a special case to virDomainDumpXML and virDomainGetInfo, so that it skips running any monitor command if there is one that has hung.

So the first 'virsh list' you run will *still* hang.

You should still be able to run other virsh/libvirt APIs against *different* guests, or against things like storage, networking, etc.

Comment 18 R P Herrold 2010-10-29 13:10:09 UTC

But isn't the correct response as simple as setting a 30 second timeout on each select() read block 

from: man 2 select

int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);

This does not seem invasive

Comment 19 Daniel Berrangé 2010-10-29 13:13:57 UTC

No, that would break many commands which are expected to take longer than 30 seconds to complete.

Comment 20 R P Herrold 2010-10-29 15:49:06 UTC

ehh?  your comment 5 says it is appropriate for the 'virsh list' command-set which is the one hanging

adding the timeout and error notification to 
   virsh list
is what is proposed, not some general roll-in of a 'one size fits all' timeout

-- Russ herrold

Comment 21 Daniel Berrangé 2010-10-29 15:59:18 UTC

The timeout described in comment #5 is relating to timeout while waiting on a lock in another part of the code, not the timing out of currently executing monitor commands. Regardless, this isn't the place for design & implementation discussions.

Comment 22 dyuan 2010-10-30 15:05:33 UTC

Verified this bug PASSED with libvirt-0.8.2-10.el5 and removed the needinfo flag.

# ps aux | grep qemu | grep -v grep | awk '{print $2}'
4703
4742

# virsh list --all
 Id Name                 State
----------------------------------
  3 rhel55               running
  4 rhel55-1             running

# kill -STOP 4742

# virsh list --all
 Id Name                 State
----------------------------------
  3 rhel55               running

^C

According to comment 17, a hang is only for guest rhel55-1 but not for libvirtd.

No hang for the following virsh commands.

# virsh list --inactive
 Id Name                 State
----------------------------------

# virsh net-list --all
Name                 State      Autostart
-----------------------------------------
default              active     yes

Comment 24 David Juran 2010-12-29 10:11:14 UTC

*** Bug 625428 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2011-01-13 22:52:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0060.html

Comment 27 Dave Allan 2011-02-01 02:18:11 UTC

*** Bug 647845 has been marked as a duplicate of this bug. ***