692663 – [Libvirt] Libvirtd hangs when qemu processes are unresponsive

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 692663 - [Libvirt] Libvirtd hangs when qemu processes are unresponsive

Summary: [Libvirt] Libvirtd hangs when qemu processes are unresponsive

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Michal Privoznik
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	634069 665979 669777 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-03-31 19:27 UTC by David Naori
Modified:	2011-12-06 11:03 UTC (History)
CC List:	19 users (show)
Fixed In Version:	libvirt-0.9.4-10.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-12-06 11:03:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
gdb (213.47 KB, text/plain) 2011-03-31 19:33 UTC, David Naori	no flags	Details
Overview of the way libvirt dispatchs RPC & issues involved with QEMU monitor blocking (9.13 KB, text/plain) 2011-08-11 10:49 UTC, Daniel Berrangé	no flags	Details
libvirtd crash log (64.12 KB, text/plain) 2011-09-07 09:22 UTC, weizhang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1513	0	normal	SHIPPED_LIVE	libvirt bug fix and enhancement update	2011-12-06 01:23:30 UTC

Description David Naori 2011-03-31 19:27:01 UTC

Description of problem:
When running ~180 vms using vdsm and- SIGSTOP to 5 qemu processes libvirtd hangs forever.

Version-Release number of selected component (if applicable):
libvirt-0.8.7-15
vdsm-4.9-57

How reproducible:
100%

Steps to Reproduce:
1.run 180 vms
2.kill -19 to 5 qemu processes 

Attached - t a a bt full of libvirtd

Comment 1 David Naori 2011-03-31 19:33:40 UTC

Created attachment 489208 [details]
gdb

Comment 2 RHEL Program Management 2011-04-04 02:05:28 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 3 Jiri Denemark 2011-05-09 13:09:23 UTC

This BZ is queued behind others I'm currently working on so I still don't have any update to put here.

Comment 4 Jiri Denemark 2011-05-16 08:17:11 UTC

The problem is a combination of several factors:

- all existing workers are waiting for reply from qemu
- we only start new workers when accepting new client connections and not when
  a new requests arrives through existing connection
- by starting new workers for new requests instead of new connections we would
  only increase the number from 5 to max_workers (20 by default)

So I think we should do something clever to make libvirt robust enough to be
able to survive any number of such misbehaving guests so that we can at least
ask libvirtd to kill them (once this functionality is in).

I was thinking about tracking how long workers are occupied with processing
their current request and automatically create a new worker for in coming
request if all workers are occupied for more than some limit.

Comment 5 Dave Allan 2011-06-10 01:58:44 UTC

*** Bug 665979 has been marked as a duplicate of this bug. ***

Comment 6 Michal Privoznik 2011-06-16 14:37:02 UTC

During implementation it turned out we need a slightly different approach:

https://www.redhat.com/archives/libvir-list/2011-June/msg00788.html

Waiting for somebody to review and ack.

Comment 7 Dave Allan 2011-06-17 18:29:51 UTC

*** Bug 634069 has been marked as a duplicate of this bug. ***

Comment 8 Dave Allan 2011-06-21 03:35:29 UTC

*** Bug 669777 has been marked as a duplicate of this bug. ***

Comment 10 weizhang 2011-07-18 03:36:41 UTC

reproduce steps:
on libvirt-0.8.7-18.el6.x86_64

1. change on /etc/libvirt/libvirtd.conf
max_clients = max_workers + 1 (at least)

2.start 20 guest
 #for i in {1..20}; do virsh start guest$i ; done

3. STGSTOP to all the qemu processes
 #for i in `ps aux | grep qemu | grep -v grep | awk '{print $2}'`; do kill -19 $i;done

4. do virsh command
# virsh list

it will hang

Comment 11 Daniel Berrangé 2011-08-11 10:49:33 UTC

Created attachment 517772 [details]
Overview of the way libvirt dispatchs RPC & issues involved with QEMU monitor blocking

Comment 12 Michal Privoznik 2011-08-16 16:43:41 UTC

Implementation of ideas Daniel suggested sent upstream (and wait for review):

https://www.redhat.com/archives/libvir-list/2011-August/msg00710.html

Comment 14 Michal Privoznik 2011-09-05 16:53:27 UTC

Moving to POST:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-September/msg00206.html

Comment 16 weizhang 2011-09-07 09:21:20 UTC

(In reply to comment #10)
> reproduce steps:
> on libvirt-0.8.7-18.el6.x86_64
> 
> 1. change on /etc/libvirt/libvirtd.conf
> max_clients = max_workers + 1 (at least)
> 
> 2.start 20 guest
>  #for i in {1..20}; do virsh start guest$i ; done
> 
> 3. STGSTOP to all the qemu processes
>  #for i in `ps aux | grep qemu | grep -v grep | awk '{print $2}'`; do kill -19
> $i;done
> 
> 4. do virsh command
> # virsh list
> 
> it will hang

Sorry for omitting 1 step before step4, should be
4. sh memstat.sh

cat memstat.sh
#!/bin/bash

for i in {1..20}
do 
  virsh dommemstat foo_$i &
done

5. do virsh list
it will hang 

can reproduced on
libvirt-0.8.7-18.el6.x86_64
qemu-kvm-0.12.1.2-2.185.el6.x86_64
kernel-2.6.32-193.el6.x86_64

When do test on 
libvirt-0.9.4-9.el6.x86_64
qemu-kvm-0.12.1.2-2.185.el6.x86_64
kernel-2.6.32-193.el6.x86_64

when do step 4 "sh memstat.sh", it will report error like
error: Failed to get memory statistics for domain foo_2
error: End of file while reading data: Input/output error

then libvirtd will crash
# service libvirtd status
libvirtd dead but pid file exists

Comment 17 weizhang 2011-09-07 09:22:32 UTC

Created attachment 521825 [details]
libvirtd crash log

Comment 18 Michal Privoznik 2011-09-07 12:30:34 UTC

Thanks for catching that. Fixed and moving to POST:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-September/msg00261.html

Comment 19 weizhang 2011-09-08 02:20:17 UTC

Thanks for resolving this so quickly.
Verify pass on
qemu-kvm-0.12.1.2-2.185.el6.x86_64
kernel-2.6.32-193.el6.x86_64
libvirt-0.9.4-10.el6.x86_64

The step is as comment 16 shows. After step 5, virsh list works fine without hang.

Comment 20 errata-xmlrpc 2011-12-06 11:03:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1513.html

Note You need to log in before you can comment on or make changes to this bug.