Bug 671564

Summary:	libvirtd crash when exceeding 30 VMs
Product:	Red Hat Enterprise Linux 6	Reporter:	Eric Blake <eblake>
Component:	libvirt	Assignee:	Eric Blake <eblake>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.1	CC:	dyuan, eblake, gren, gsun, mzhan, veillard, xen-maint
Target Milestone:	rc
Target Release:	6.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-0.8.7-4.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-05-19 13:26:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Blake 2011-01-21 21:12:56 UTC

Description of problem:
Upstream commit e6b68d7 introduced a regression where the libvirtd daemon could try to free active event handles (fds being polled or timeouts); if the event subsequently occurs, this can result in arbitrary behavior including crashes.

Version-Release number of selected component (if applicable):
libvirt-0.8.7-3.el6

How reproducible:
100%

Steps to Reproduce:
1. start 60 VMs: for i in `seq 60`; do virsh start vm$i; done
2. stop 60 VMs: for i in `seq 60`; do virsh destroy vm$i; done
3. list all known VMs: virsh list --all
  
Actual results:
libvirtd crashed immediately after stopping the 60th VM

Expected results:
libvirt should never crash

Additional info:
This may be the root cause of bug 670848, although the stack trace from the above test does not match that bug report, so this is opened as a separate BZ until we are sure that there aren't any other problems.

Initial upstream patch to fix this and one other bug:
https://www.redhat.com/archives/libvir-list/2011-January/msg00921.html

Comment 2 Daniel Veillard 2011-01-24 02:23:51 UTC

ACK, we really need to get this fixed !

thanks for chasing this, hopefully it will also solve 670848 !

Daniel

Comment 5 Min Zhan 2011-01-30 06:11:53 UTC

Verified with Passed in below environment:
# uname -a
Linux intel-5405-32-4.englab.nay.redhat.com 2.6.32-107.el6.x86_64 #1 SMP Thu Jan 27 23:11:23 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

libvirt-0.8.7-4.el6.x86_64
kernel-2.6.32-107.el6.x86_64
qemu-kvm-0.12.1.2-2.132.el6.x86_64

Steps
1. start 60 VMs: 
# for i in `seq 60`; do virsh start a$i; done
Domain a1 started

Domain a2 started

Domain a3 started
...

2. stop 60 VMs: 
# for i in `seq 60`; do virsh destroy a$i; done
Domain a1 destroyed

Domain a2 destroyed

Domain a3 destroyed
....

3. list all VMs: # virsh list --all
# virsh list --all
 Id Name                 State
----------------------------------
  - a1                   shut off
  - a10                  shut off
  - a11                  shut off
  - a12                  shut off
  - a13                  shut off
  - a14                  shut off
  - a15                  shut off
  - a16                  shut off
  - a17                  shut off
  - a18                  shut off
  - a19                  shut off
  - a2                   shut off
  - a20                  shut off
.....
--------------
I have reproduced this bug with libvirt-0.8.7-3.el6. It also applys for 60 guests.
# for i in `seq 40`; do virsh start a$i; done

# for i in `seq 40`; do virsh destroy a$i; done
...
Domain a38 destroyed

Domain a39 destroyed

error: cannot recv data: : Connection reset by peer
error: failed to connect to the hypervisor


# virsh list --all
error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Connection refused
error: failed to connect to the hypervisor

# service libvirtd status
libvirtd dead but pid file exists

Comment 8 errata-xmlrpc 2011-05-19 13:26:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html