Bug 1033061

Summary: [abrt] libvirt-daemon-1.1.1-12.el7: xsltFreeExtDef: Process /usr/sbin/libvirtd was killed by signal 11 (SIGSEGV)
Product: Red Hat Enterprise Linux 7 Reporter: Tomas Dolezal <todoleza>
Component: libvirtAssignee: Michal Privoznik <mprivozn>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0CC: acathrow, dyuan, mprivozn, mzhan, todoleza, ydu, zhwang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: abrt_hash:b60cc3ca4585400bb07667cd9b86aa38e74d8d0c
Fixed In Version: libvirt-1.1.1-14.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-13 12:15:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
File: backtrace
none
File: cgroup
none
File: core_backtrace
none
File: dso_list
none
File: environ
none
File: limits
none
File: maps
none
File: open_fds
none
File: proc_pid_status
none
File: var_log_messages none

Description Tomas Dolezal 2013-11-21 13:50:44 UTC
Description of problem:
restarted libvirtd.service because of  inability to communicate with vm over network

Version-Release number of selected component:
libvirt-daemon-1.1.1-12.el7

Additional info:
reporter:       libreport-2.1.7
backtrace_rating: 4
cmdline:        /usr/sbin/libvirtd
crash_function: xsltFreeExtDef
executable:     /usr/sbin/libvirtd
kernel:         3.10.0-50.el7.x86_64
runlevel:       N 5
type:           CCpp
uid:            0

Truncated backtrace:
Thread no. 1 (10 frames)
 #1 xsltFreeExtDef at extensions.c:137
 #2 xsltFreeExtDefList at extensions.c:157
 #3 xsltFreeExts at extensions.c:482
 #4 xsltFreeStylesheet at xslt.c:964
 #5 drv_close at drv_redhat.c:383
 #6 ncf_close at netcf.c:101
 #7 virObjectUnref at util/virobject.c:262
 #8 netcfInterfaceClose at interface/interface_backend_netcf.c:265
 #9 virConnectDispose at datatypes.c:149
 #10 virObjectUnref at util/virobject.c:262

Comment 1 Tomas Dolezal 2013-11-21 13:50:51 UTC
Created attachment 827186 [details]
File: backtrace

Comment 2 Tomas Dolezal 2013-11-21 13:50:56 UTC
Created attachment 827187 [details]
File: cgroup

Comment 3 Tomas Dolezal 2013-11-21 13:51:02 UTC
Created attachment 827188 [details]
File: core_backtrace

Comment 4 Tomas Dolezal 2013-11-21 13:51:07 UTC
Created attachment 827189 [details]
File: dso_list

Comment 5 Tomas Dolezal 2013-11-21 13:51:13 UTC
Created attachment 827190 [details]
File: environ

Comment 6 Tomas Dolezal 2013-11-21 13:51:17 UTC
Created attachment 827191 [details]
File: limits

Comment 7 Tomas Dolezal 2013-11-21 13:51:30 UTC
Created attachment 827194 [details]
File: maps

Comment 8 Tomas Dolezal 2013-11-21 13:51:36 UTC
Created attachment 827197 [details]
File: open_fds

Comment 9 Tomas Dolezal 2013-11-21 13:51:43 UTC
Created attachment 827199 [details]
File: proc_pid_status

Comment 10 Tomas Dolezal 2013-11-21 13:51:49 UTC
Created attachment 827201 [details]
File: var_log_messages

Comment 11 Tomas Dolezal 2013-11-21 13:52:00 UTC
Created attachment 827205 [details]
File: sosreport.tar.xz

Comment 13 Michal Privoznik 2013-11-26 17:50:48 UTC
Just for the record; it seems like we're hitting a gcc bug here. From the stack trace we can clearly see virObjectUnref() is called from two separate threads over the very same object (obj = 0x7f4968018260). Both threads got reference counter equal to zero and hence started executing destroy callback.

The code snippet that is responsible for atomic dec & test:

#  define virAtomicIntDecAndTest(atomic)                                \
    (__extension__ ({                                                   \
            (void)verify_true(sizeof(*(atomic)) == sizeof(int));        \
            (void)(0 ? *(atomic) ^ *(atomic) : 0);                      \
            __sync_fetch_and_sub((atomic), 1) == 1;                     \
        }))

This function is supposed to return true if the callee held the last reference to an object. Otherwise, false is expected. However, it seems like __sync_fetch_and_sub() is not that atomic as one might think as it returned 1 in both threads. I'm investigating this further and I'll keep the BZ updated.

Comment 14 Michal Privoznik 2013-11-28 12:19:18 UTC
Tomas, can you please attach debug logs http://wiki.libvirt.org/page/DebugLogs  and give me steps to reproduce? I'm unable to reproduce this myslef.

Comment 15 Michal Privoznik 2013-11-29 11:06:36 UTC
I *think* this patch may help me to understand what's going on. I've built a scratch build for you to test (if you have some spare time):

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=6664683

Please make sure you've got debug logs turned on when reproducing the bug:

http://wiki.libvirt.org/page/DebugLogs

Comment 16 Tomas Dolezal 2013-11-29 15:09:25 UTC
Hi Michal,
unfortunately this abrt-caught bug's reproducer is unknown to me. It happened just after restarting libvirtd.service. I did it because networking between host&guest was not working at all. Neither the restart helped as it usually does. All went running again after reboot.

Comment 17 Michal Privoznik 2013-12-02 17:21:02 UTC
Nevermind. With couple of tries (and intentionally inserted sleep()-s into the right places) I was able to reproduce it. And I think I know what's going on. Unfortunatelly, I don't have a design how to fix it. Neiter has the upstream:

http://www.redhat.com/archives/libvir-list/2013-October/msg00754.html

Comment 18 Michal Privoznik 2013-12-03 10:39:55 UTC
I believe this patch solves the issue:

https://www.redhat.com/archives/libvir-list/2013-December/msg00149.html

Comment 21 yanbing du 2013-12-13 07:09:48 UTC
Thanks for @Michal's mail.
Now, for libvirt-1.1.1-14.el7.x86_64, the logs look like good, it this OK to move this bug to VERIFIED?
...
2013-12-13 07:03:51.031+0000: 11933: debug : virObjectNew:199 : OBJECT_NEW: obj=0x7f0f340b8d50 classname=virNetcfDriverState
2013-12-13 07:03:51.456+0000: 11933: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f0f340b8d50
2013-12-13 07:03:51.457+0000: 11933: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f0f340b8d50
2013-12-13 07:03:51.522+0000: 11933: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f0f340b8d50
2013-12-13 07:03:51.522+0000: 11933: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f0f340b8d50
2013-12-13 07:03:51.522+0000: 11933: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f0f340b8d50
2013-12-13 07:03:51.523+0000: 11933: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f0f340b8d50
2013-12-13 07:03:51.523+0000: 11933: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f0f340b8d50
2013-12-13 07:03:51.524+0000: 11933: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f0f340b8d50
...

Comment 22 Michal Privoznik 2013-12-13 08:05:03 UTC
I rather see whole log before confirming for sure. The key point here is that an object is not used *after* it has been disposed. That is:

2013-12-13 07:03:51.031+0000: 11933: debug : virObjectNew:199 : OBJECT_NEW: obj=0x7f0f340b8d50 classname=virNetcfDriverState

This creates a new object. Then we see a couple of refs and unrefs:

2013-12-13 07:03:51.456+0000: 11933: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f0f340b8d50
2013-12-13 07:03:51.457+0000: 11933: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f0f340b8d50

So far so good. But what's missing is the last part, where the object is disposed, something like

2013-12-13 07:03:51.457+0000: 11933: debug : virObjectUnref:256 : OBJECT_DISPOSE: obj=0x7f0f340b8d50

and then we must not see any REF on obj=0x7f0f340b8d50.

Comment 23 yanbing du 2013-12-13 08:29:24 UTC
(In reply to Michal Privoznik from comment #22)
> I rather see whole log before confirming for sure. The key point here is
> that an object is not used *after* it has been disposed. That is:
> 
> 2013-12-13 07:03:51.031+0000: 11933: debug : virObjectNew:199 : OBJECT_NEW:
> obj=0x7f0f340b8d50 classname=virNetcfDriverState
> 
> This creates a new object. Then we see a couple of refs and unrefs:
> 
> 2013-12-13 07:03:51.456+0000: 11933: debug : virObjectRef:293 : OBJECT_REF:
> obj=0x7f0f340b8d50
> 2013-12-13 07:03:51.457+0000: 11933: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f0f340b8d50
> 
> So far so good. But what's missing is the last part, where the object is
> disposed, something like

Yes, i didn't restart ibvirtd before collect the logs, so missed the last part.
Now the logs look like:
2013-12-13 08:23:52.830+0000: 12158: debug : virObjectNew:199 : OBJECT_NEW: obj=0x7f22780b8990 classname=virNetcfDriverState
2013-12-13 08:23:53.271+0000: 12158: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f22780b8990
2013-12-13 08:23:53.272+0000: 12158: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f22780b8990
2013-12-13 08:23:53.330+0000: 12158: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f22780b8990
2013-12-13 08:23:53.331+0000: 12158: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f22780b8990
2013-12-13 08:23:53.332+0000: 12158: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f22780b8990
2013-12-13 08:23:53.332+0000: 12158: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f22780b8990
2013-12-13 08:23:53.333+0000: 12158: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f22780b8990
2013-12-13 08:23:53.333+0000: 12158: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f22780b8990
2013-12-13 08:24:25.768+0000: 12147: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f22780b8990
2013-12-13 08:24:25.768+0000: 12147: debug : virObjectUnref:258 : OBJECT_DISPOSE: obj=0x7f22780b8990


After OBJECT_DISPOSE, there no ref to the object.


> 
> 2013-12-13 07:03:51.457+0000: 11933: debug : virObjectUnref:256 :
> OBJECT_DISPOSE: obj=0x7f0f340b8d50
> 
> and then we must not see any REF on obj=0x7f0f340b8d50.

Comment 24 Michal Privoznik 2013-12-13 08:35:33 UTC
(In reply to yanbing du from comment #23)

> 
> After OBJECT_DISPOSE, there no ref to the object.
> 

Brilliant! The bug is fixed and can be moved to VERIFIED.

Comment 25 yanbing du 2013-12-16 02:41:12 UTC
(In reply to Michal Privoznik from comment #24)
> (In reply to yanbing du from comment #23)
> 
> > 
> > After OBJECT_DISPOSE, there no ref to the object.
> > 
> 
> Brilliant! The bug is fixed and can be moved to VERIFIED.

Thanks, move bug to VERIFIED.

Comment 26 Ludek Smid 2014-06-13 12:15:11 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Comment 28 Pavel Hrdina 2014-11-05 12:06:34 UTC
*** Bug 966449 has been marked as a duplicate of this bug. ***