Bug 918959

Summary: [abrt] libvirt-0.10.2-18.el6: _int_free: Process /usr/sbin/libvirtd was killed by signal 11 (SIGSEGV)
Product: Red Hat Enterprise Linux 6 Reporter: David Jaša <djasa>
Component: libvirtAssignee: John Ferlan <jferlan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.4CC: acathrow, dyasny, dyuan, eblake, jentrena, mzhan, pkrempa, pzhukov, rdassen, rwu, tdosek, whuang, ydu
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: abrt_hash:2fc968e737a27deb64b13469804ac233fbd92448
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-05-08 17:53:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 835616, 928309, 960054    
Attachments:
Description Flags
File: maps
none
File: var_log_messages
none
File: environ
none
File: dso_list
none
File: limits
none
File: sosreport.tar.xz
none
File: backtrace
none
File: build_ids
none
File: cgroup none

Description David Jaša 2013-03-07 10:07:24 UTC
Description of problem:
This crash occurred during back-and-forth migration of a VM (with another instance of the libvirt).

Version-Release number of selected component:
libvirt-0.10.2-18.el6

Additional info:
libreport version: 2.0.9
abrt_version:   2.0.8
backtrace_rating: 4
cmdline:        libvirtd --daemon --listen
crash_function: _int_free
kernel:         2.6.32-358.el6.x86_64

truncated backtrace:
:Thread no. 1 (7 frames)
: #0 _int_free at malloc.c
: #1 virFree at util/memory.c
: #2 virObjectUnref at util/virobject.c
: #3 virEventPollCleanupHandles at util/event_poll.c
: #4 virEventPollRunOnce at util/event_poll.c
: #5 virEventRunDefaultImpl at util/event.c
: #6 virNetServerRun at rpc/virnetserver.c

Comment 1 David Jaša 2013-03-07 10:07:28 UTC
Created attachment 706483 [details]
File: maps

Comment 2 David Jaša 2013-03-07 10:07:30 UTC
Created attachment 706484 [details]
File: var_log_messages

Comment 3 David Jaša 2013-03-07 10:07:32 UTC
Created attachment 706485 [details]
File: environ

Comment 4 David Jaša 2013-03-07 10:07:35 UTC
Created attachment 706486 [details]
File: dso_list

Comment 5 David Jaša 2013-03-07 10:07:37 UTC
Created attachment 706487 [details]
File: limits

Comment 6 David Jaša 2013-03-07 10:07:46 UTC
Created attachment 706488 [details]
File: sosreport.tar.xz

Comment 7 David Jaša 2013-03-07 10:07:49 UTC
Created attachment 706489 [details]
File: backtrace

Comment 8 David Jaša 2013-03-07 10:07:51 UTC
Created attachment 706490 [details]
File: build_ids

Comment 9 David Jaša 2013-03-07 10:07:54 UTC
Created attachment 706491 [details]
File: cgroup

Comment 14 Pavel Zhukov 2013-03-25 14:05:24 UTC
Is https://bugzilla.redhat.com/show_bug.cgi?id=924756 duplicate of this bug?

Comment 15 John Ferlan 2013-03-25 22:49:17 UTC
Just so you know - I am digging into this. It's a bit slow going as I am new at digging into RH/libvirtd problems.  I'm running under the assumption it's an error path type thing right now.

I know the case indicates the error occurred with back-n-forth migration; however, I'm curious if there was anything else being attempted?  I note in the messages output from the sos tarball that there's a series of "Listening on interface #xx" and "Deleting interface #xx" messages right around the crash (where xx = 10, 11, 12, 13, 14, 15, & 16).  

Around the time 13, 15, & 16 go through their iterations there are other segfaults listed in the output dealing with qemu-kvm and libspice-server.so.

The reason I note this is I have to "wonder" if this type of migration was working without error until only recently.  What caught my eye was the yum.log output indicating a recent change/update to spice-server and I'm wondering if there's a relationship between the two. I'm not pointing fingers, but just trying to glean some more data.  In particular if this was working well previously and libvirt didn't change, then what other environmental factor could caused a failure.

Comment 16 Pavel Zhukov 2013-03-26 07:52:34 UTC
(In reply to comment #15)

> The reason I note this is I have to "wonder" if this type of migration was
> working without error until only recently.  What caught my eye was the
> yum.log output indicating a recent change/update to spice-server and I'm
> wondering if there's a relationship between the two. I'm not pointing
> fingers, but just trying to glean some more data.  In particular if this was
> working well previously and libvirt didn't change, then what other
> environmental factor could caused a failure.

John, It's RHEV-H system, You could not find any changes because there are not yum there as well as we cannot install customs RPM without hacks... 
FYI The problem case with "Red Hat Enterprise Virtualization Hypervisor release 6.4 (20130306.2.el6_4)" and bundled libvirt-0.10.2-18.el6.

Comment 17 Eric Blake 2013-03-26 20:19:07 UTC
I wonder if this upstream patch has any relation:
https://www.redhat.com/archives/libvir-list/2013-March/msg01489.html

Comment 18 Eric Blake 2013-03-26 20:28:28 UTC
Another one worth looking at (still needs upstream review as I type this comment):
https://www.redhat.com/archives/libvir-list/2013-March/msg01469.html

Comment 21 Eric Blake 2013-04-02 21:31:16 UTC
(In reply to comment #0)
> Description of problem:
> This crash occurred during back-and-forth migration of a VM (with another
> instance of the libvirt).

Was this using peer-to-peer migration? If so, then I'm pretty sure this patch series explains the problem:

https://www.redhat.com/archives/libvir-list/2013-March/msg01682.html

> 
> truncated backtrace:
> :Thread no. 1 (7 frames)
> : #0 _int_free at malloc.c
> : #1 virFree at util/memory.c
> : #2 virObjectUnref at util/virobject.c
> : #3 virEventPollCleanupHandles at util/event_poll.c

At any rate, this portion of the stack trace is consistent with trying to free through a pointer deleted in another thread.

Comment 22 Eric Blake 2013-04-08 22:19:03 UTC
Peter's patches to fix the close callback race solve a problem introduced in upstream 0.10.0, and therefore present in RHEL 6.4 (based on upstream 0.10.2) but not 6.3 (based on upstream 0.9.10):
https://www.redhat.com/archives/libvir-list/2013-April/msg00672.html
As such, I'm adding the regression flag.

Comment 23 Peter Krempa 2013-04-09 10:20:44 UTC
A scratch build containing fixes that are believed to fix this problem is available at:

https://brewweb.devel.redhat.com/taskinfo?taskID=5610687

Comment 25 David Jaša 2013-04-09 19:07:04 UTC
(In reply to comment #21)
> (In reply to comment #0)
> > Description of problem:
> > This crash occurred during back-and-forth migration of a VM (with another
> > instance of the libvirt).
> 
> Was this using peer-to-peer migration?

If peer-to-peer migration is result of commands like these:
virsh -c qemu+tcp://source_host/system migrate --live VM_NAME qemu+tcp://dest_host/system

then yes, it was peer-to-peer migration.

I hit the bug just once though so I'm not able to tell decisively that the bug is fixed for me.

Comment 26 Eric Blake 2013-04-09 19:22:25 UTC
peer-to-peer migration involves the --p2p flag of 'virsh migrate'.  But the command line you used omitted --p2p, so it was direct.  http://libvirt.org/migration.html shows the difference - in direct migration, libvirt.so is the client to two different libvirtd processes; in peer-to-peer migration, libvirt.so is the client to only one libvirtd process, and that libvirtd is in turn client to another libvirtd.

Peter's patches (comment 23) had to do with a crash in the client; that would explain the source libvirtd crashing on a peer-to-peer migration (since the source is a client to the destination), but would not explain you seeing a crash in libvirtd with direct migration (there, you would expect virsh to die as the client to either source or destination, but not for libvirtd to die).  See also bug 911609.

I'm still looking for other potential races, where the race would affect the server rather than the client, to match with your report of libvirtd crashing on a direct migration.

Comment 27 Eric Blake 2013-04-10 02:04:30 UTC
bug 915353 describes a crash on shutdown; it was fixed for libvirt-0.10.2-18.el6_4.1 - I'm starting to think that this particular fix is the one that solves the problem at hand.

Comment 28 Eric Blake 2013-04-10 02:21:54 UTC
Another possible cause is a crash on auto-destroy, bug 950286.  Migration uses auto-destroy on the destination until the source is far enough along in the migration process, where a bug there could crash libvirtd.

Comment 29 John Ferlan 2013-05-08 17:53:35 UTC
Since this problem was not easily reproduced and there is a patch available that resolves similarly described problems, I was asked to close this bug as insufficient data with a reference to the available patch.

If after installing the updates described here:

http://rhn.redhat.com/errata/RHBA-2013-0756.html

the problem still occurs, then feel free to reopen this case or open a new problem.