Bug 929412

Summary: libvirt 1.0.3 appears to deadlock
Product: [Community] Virtualization Tools Reporter: Matthew Garrett <mjg59>
Component: libvirtAssignee: Daniel Berrangé <berrange>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: acathrow, bdpayne, berrange, dallan, davanum, hzwangpan, laine, mjg59, veillard
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1034806 1034807 (view as bug list) Environment:
Last Closed: 2014-02-04 10:38:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1034806, 1034807    
Attachments:
Description Flags
Backtrace from libvirt none

Description Matthew Garrett 2013-03-30 12:23:10 UTC
Created attachment 718240 [details]
Backtrace from libvirt

We're seeing deadlocks under 1.0.3. I'll attach a traceback, but it looks like virNWFilterDomainFWUpdateCB is trying to take a lock on an object while holding updateMutex (and blocking), and virNWFilterInstantiateFilter is trying to take updateMutex. We didn't see this in 1.0.2. 37abd471656957c76eac687ce2ef94d79c8e2731 seems like a plausible candidate?

Comment 1 Daniel Veillard 2013-03-30 16:46:08 UTC
Hum, I didn't see an obvious patch for such an issue in the git commits since
v1.0.3, but if you have time giving a try to 1.0.4-rc2 it is available
at ftp://libvirt.org/libvirt/

Thanks for the backtrace, I see a thread in qemuNetworkIfaceConnect too
do you have a specific scenario to reproduce this ? That libvirtd is quite
busy !

Comment 2 Matthew Garrett 2013-03-30 18:07:29 UTC
I'll see if I can get a full description of the reproduction case set up and give 1.0.4 a go - it'll be some time next week.

Comment 3 Matthew Garrett 2013-11-25 18:40:46 UTC
Still seeing this with 1.1.4, in exactly the same circumstances. This is while we're doing load testing, so there's a large number of instances being created and destroyed at around the same time. I don't have a trivial reproduction case.

Comment 4 Dave Allan 2013-11-25 18:56:14 UTC
Roughly how often are you seeing this and are you willing to install test builds to try to identify the source?

Comment 5 Matthew Garrett 2013-11-25 19:00:18 UTC
2 or 3 days under heavy load is enough to trigger it. This is a test environment, so I can test patches. The cause seems to be that the virDomainCreateWithFlags()→_virNWFilterInstantiateFilter() path calls virObjectLock() and then virNWFilterLockFilterUpdates(), while the remoteDispatchNWFilterUndefine()→virNWFilterDomainFWUpdateCB() path calls virNWFilterLockFilterUpdates() and then virObjectLock().

Comment 6 Daniel Berrangé 2013-11-26 10:56:07 UTC
Confirmed from inspection that the lock ordering is fubar here. In addition to the  nwfilterUndefine method, the nwfilterDefineXML will suffer the same flaw. The code naively assumed that making the nwfilter mutex recursive would avoid the issuing, ignoring the fact that the domain lock filter is not recursive. The code should have been written to avoid the recursively locking completely.

Comment 7 Dave Allan 2013-12-10 19:06:59 UTC
Matthew, is this a torture test or is there a use case that's triggering this for you?

Comment 8 Matthew Garrett 2013-12-10 19:14:20 UTC
It's part of our release validation process rather than normal use, but it means we're stuck on 1.0.2.

Comment 9 Daniel Berrangé 2014-01-21 16:26:23 UTC
OpenStack appears to be suffering from this bug too https://bugs.launchpad.net/nova/+bug/1228977

Comment 10 Davanum Srinivas 2014-01-21 21:35:43 UTC
The OpenStack bug is using the following package/versions from Ubuntu Cloud Archive

libvirt-bin (1.1.1-0ubuntu8~cloud2)
python-libvirt (1.1.1-0ubuntu8~cloud2)

-- dims

Comment 11 Laine Stump 2014-01-22 10:07:04 UTC
This bug will be my top priority as soon as I clean up one thing for netcf, which should happen before the end of today.

In short, the problem is that NWFilter code has a loop at the end of undefining/defining any filter that tries to lock each domain *while the nwfilter lock is held*. Likewise, while a domain is being started/stopped, it locks the domain, then at some point tries to grab the NWFilter lock.

We either need to move the domain cleanup at the end of a NWFilter define/undefine to outside the NWFilter lock (re-locking and checking for updates after getting the domain lock), move domain code calling NWFilter code to outside the domain lock (again, re-locking and checking for changes to domain status after getting the nwfilter lock), or come up with something else more complicated.

Comment 12 Davanum Srinivas 2014-01-22 16:36:03 UTC
@Lanie, 

Thanks. 

per @berrange's request, I've recreated the problem with server side logging enabled. please see libvirtd.txt.gz at the following url:

http://logs.openstack.org/64/67564/6/check/check-tempest-dsvm-full/dd0e0de/logs/

The screen-n-cpu.txt.gz has the client-side logs as well

Comment 13 Laine Stump 2014-02-04 10:38:48 UTC
Daniel fixed this Upstream:

commit c065984b58000a44c90588198d222a314ac532fd
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 22 15:26:21 2014 +0000

    Add a read/write lock implementation

commit 6e5c79a1b5a8b3a23e7df7ffe58fb272aa17fbfb
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 22 17:28:29 2014 +0000

    Push nwfilter update locking up to top level

The following two patches may also be needed (between the 1st and 2nd patches above) if building for a Windows target:

commit ab6979430a750603464eb55925647c15c20e001f
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 29 13:54:11 2014 +0000

    Fix pthread_sigmask check for mingw32 without winpthreads

commit 0240d94c36c8ce0e7c35b5be430acd9ebf5adcfa
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 22 16:17:10 2014 +0000

    Remove windows thread implementation in favour of pthreads

Comment 14 Laine Stump 2014-02-06 16:07:49 UTC
I have now backported and pushed these patches to all of the upstream git -maint branches as far back as v1.0.3-maint, so any downstream release should be able to just pull the appropriate -maint and rebuild to fix the problem.

Comment 15 Wangpan 2014-02-19 07:15:10 UTC
after I cherry-picked these to patches, I got a new bug, it may be caused by these patch, but I have cherry-picked other patches of the 1.1.4-manit, so I'm not sure:
https://bugzilla.redhat.com/show_bug.cgi?id=1066801