Bug 671564

Summary: libvirtd crash when exceeding 30 VMs
Product: Red Hat Enterprise Linux 6 Reporter: Eric Blake <eblake>
Component: libvirtAssignee: Eric Blake <eblake>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1CC: dyuan, eblake, gren, gsun, mzhan, veillard, xen-maint
Target Milestone: rc   
Target Release: 6.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-0.8.7-4.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 13:26:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Blake 2011-01-21 21:12:56 UTC
Description of problem:
Upstream commit e6b68d7 introduced a regression where the libvirtd daemon could try to free active event handles (fds being polled or timeouts); if the event subsequently occurs, this can result in arbitrary behavior including crashes.

Version-Release number of selected component (if applicable):
libvirt-0.8.7-3.el6

How reproducible:
100%

Steps to Reproduce:
1. start 60 VMs: for i in `seq 60`; do virsh start vm$i; done
2. stop 60 VMs: for i in `seq 60`; do virsh destroy vm$i; done
3. list all known VMs: virsh list --all
  
Actual results:
libvirtd crashed immediately after stopping the 60th VM

Expected results:
libvirt should never crash

Additional info:
This may be the root cause of bug 670848, although the stack trace from the above test does not match that bug report, so this is opened as a separate BZ until we are sure that there aren't any other problems.

Initial upstream patch to fix this and one other bug:
https://www.redhat.com/archives/libvir-list/2011-January/msg00921.html

Comment 2 Daniel Veillard 2011-01-24 02:23:51 UTC
ACK, we really need to get this fixed !

thanks for chasing this, hopefully it will also solve 670848 !

Daniel

Comment 5 Min Zhan 2011-01-30 06:11:53 UTC
Verified with Passed in below environment:
# uname -a
Linux intel-5405-32-4.englab.nay.redhat.com 2.6.32-107.el6.x86_64 #1 SMP Thu Jan 27 23:11:23 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

libvirt-0.8.7-4.el6.x86_64
kernel-2.6.32-107.el6.x86_64
qemu-kvm-0.12.1.2-2.132.el6.x86_64

Steps
1. start 60 VMs: 
# for i in `seq 60`; do virsh start a$i; done
Domain a1 started

Domain a2 started

Domain a3 started
...

2. stop 60 VMs: 
# for i in `seq 60`; do virsh destroy a$i; done
Domain a1 destroyed

Domain a2 destroyed

Domain a3 destroyed
....

3. list all VMs: # virsh list --all
# virsh list --all
 Id Name                 State
----------------------------------
  - a1                   shut off
  - a10                  shut off
  - a11                  shut off
  - a12                  shut off
  - a13                  shut off
  - a14                  shut off
  - a15                  shut off
  - a16                  shut off
  - a17                  shut off
  - a18                  shut off
  - a19                  shut off
  - a2                   shut off
  - a20                  shut off
.....
--------------
I have reproduced this bug with libvirt-0.8.7-3.el6. It also applys for 60 guests.
# for i in `seq 40`; do virsh start a$i; done

# for i in `seq 40`; do virsh destroy a$i; done
...
Domain a38 destroyed

Domain a39 destroyed

error: cannot recv data: : Connection reset by peer
error: failed to connect to the hypervisor


# virsh list --all
error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Connection refused
error: failed to connect to the hypervisor

# service libvirtd status
libvirtd dead but pid file exists

Comment 8 errata-xmlrpc 2011-05-19 13:26:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html