Bug 903238

Summary: Concurrency/locking causes segfault
Product: Red Hat Enterprise Linux 6 Reporter: Dave Allan <dallan>
Component: libvirtAssignee: Michal Privoznik <mprivozn>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: high    
Version: 6.4CC: acathrow, cpelland, cwei, dallan, dyuan, jdenemar, mprivozn, mzhan, ssullivan, ydu, zhwang
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.10.2-19.el6 Doc Type: Bug Fix
Doc Text:
Occasionally, when users ran multiple virsh create/destroy loops, a race condition could have occurred and libvirtd terminated unexpectedly with a segmentation fault. False error messages regarding the domain having already been destroyed to the caller also occurred. With this update, the outlined script is run and completes without libvirtd crashing.
Story Points: ---
Clone Of: 892901 Environment:
Last Closed: 2013-11-21 08:41:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 892649, 892901    
Bug Blocks: 915353    
Attachments:
Description Flags
I modified the scirpt to match my environment none

Description Dave Allan 2013-01-23 14:44:02 UTC
+++ This bug was initially created as a clone of Bug #892901 +++

+++ This bug was initially created as a clone of Bug #892649 +++

Description of problem:

When running multiple virsh create/destroy loops, sometimes (if the timing is right) a segfault will occur, causing libvirtd to crash. 

Version-Release number of selected component (if applicable):

This problem was introduced with v0.9.12. I cannot reproduce this issue under v0.9.11.X or older. I am able to reproduce this problem as well with the latest code from master.

How reproducible:

This posting has the steps to reproduce the problem:

http://www.redhat.com/archives/libvir-list/2012-December/msg01365.html

Steps to Reproduce:
1. Go to above link, follow steps outlined.
  
Actual results:

When the script is running and doing its operations with libvirtd, within 10 or 20 minutes libvirtd will segfault. 

Expected results:

The script outlined all get ran and complete without libvirtd crashing.

Additional info:

All additional info is in the list; including multiple GDB output from the crashes I reproduced. In addition, there was a patch by Michal Privoznik (http://www.redhat.com/archives/libvir-list/2012-December/msg01372.html) that attempted to fix this problem, however the issue still occurs after applying this patch on top of v1.0.0 or v1.0.1. 

Here was Michals response once I told him his patch wasn't working for me:

http://www.redhat.com/archives/libvir-list/2012-December/msg01378.html

--- Additional comment from Scott Sullivan on 2013-01-22 12:33:46 EST ---

As the original reporter of this bug, I can say for me at least this issue was fixed with this commit:

http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=81621f3e6e45e8681cc18ae49404736a0e772a11

Comment 1 Michal Privoznik 2013-01-23 15:34:26 UTC
Moving to POST:

commit 81621f3e6e45e8681cc18ae49404736a0e772a11
Author:     Daniel P. Berrange <berrange>
AuthorDate: Fri Jan 18 14:33:51 2013 +0000
Commit:     Daniel P. Berrange <berrange>
CommitDate: Fri Jan 18 15:45:38 2013 +0000

    Fix race condition when destroying guests
    
    When running virDomainDestroy, we need to make sure that no other
    background thread cleans up the domain while we're doing our work.
    This can happen if we release the domain object while in the
    middle of work, because the monitor might detect EOF in this window.
    For this reason we have a 'beingDestroyed' flag to stop the monitor
    from doing its normal cleanup. Unfortunately this flag was only
    being used to protect qemuDomainBeginJob, and not qemuProcessKill
    
    This left open a race condition where either libvirtd could crash,
    or alternatively report bogus error messages about the domain already
    having been destroyed to the caller
    
    Signed-off-by: Daniel P. Berrange <berrange>

v1.0.1-349-g81621f3

Comment 4 zhenfeng wang 2013-07-19 08:39:25 UTC
Hi Dave,
I need to verify this bug on the latest libvirt version, However, i couldn't reproduce this bug on my machine, evenif i try to run the script for many times and a long time(the whole night and half a day). So can you help me check it, do i mistake to do something during my reproduce ?  thanks

1 my environment info
ernel-2.6.32-358.el6.x86_64
libvirt-0.10.2-18.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.2.x86_64

2.run the script as the above link
http://www.redhat.com/archives/libvir-list/2012-December/msg01365.html

3.I always met the following error while i ran the script for a long time

####
Shutting down vg_ssd...
	virsh destroy tnrekhko
	virsh list
error: Failed to reconnect to the hypervisor
error: no valid connection
error: Cannot recv data: Connection reset by peer

Starting vg_ssd...
	virsh create /tmp/tnrekhko.cfg
error: Failed to reconnect to the hypervisor
error: no valid connection
error: Cannot recv data: Connection reset by peer

	virsh list
error: Failed to reconnect to the hypervisor
error: no valid connection
error: Cannot recv data: Connection reset by peer

Removing vg_ssd...
	virsh destroy tnrekhko
error: Failed to reconnect to the hypervisor
error: no valid connection
error: Cannot recv data: Connection reset by peer

	virsh list
	lvremove -f /dev/vg_ssd/tnrekhko
  Logical volume "sxxdnpbj" successfully removed
####
# service libvirtd status
libvirtd (pid  4613) is running...

Comment 5 zhenfeng wang 2013-07-19 08:41:20 UTC
Created attachment 775681 [details]
I modified the scirpt to match my environment

Comment 6 Michal Privoznik 2013-07-19 09:05:05 UTC
The problem is, your libvirt connection gets closed for some reason. But since we have the same patch in 6.4.z and we've verified it there successfully, I don't expect this one to be different.

Comment 7 zhenfeng wang 2013-07-24 01:52:59 UTC
I can reproduce this bug with libvirt-0.10.2-18.el6.x86_64.
Following the steps from
http://www.redhat.com/archives/libvir-list/2012-December/msg01365.html
Then, Run the script about 20 mins later, libvirtd crashed:

# virsh list
error: Failed to reconnect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

# service libvirtd status
libvirtd dead but pid file exists


Then retest upper steps with libvirt-0.10.2-19.el6, run the script about 1 hour , the libvirtd always keep running status, So this bug can be marked VERIFIED.

Comment 9 errata-xmlrpc 2013-11-21 08:41:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1581.html