Hide Forgot
+++ This bug was initially created as a clone of Bug #929412 +++ We're seeing deadlocks under 1.0.3. I'll attach a traceback, but it looks like virNWFilterDomainFWUpdateCB is trying to take a lock on an object while holding updateMutex (and blocking), and virNWFilterInstantiateFilter is trying to take updateMutex. We didn't see this in 1.0.2. 37abd471656957c76eac687ce2ef94d79c8e2731 seems like a plausible candidate? --- Additional comment from Daniel Veillard on 2013-03-30 12:46:08 EDT --- Hum, I didn't see an obvious patch for such an issue in the git commits since v1.0.3, but if you have time giving a try to 1.0.4-rc2 it is available at ftp://libvirt.org/libvirt/ Thanks for the backtrace, I see a thread in qemuNetworkIfaceConnect too do you have a specific scenario to reproduce this ? That libvirtd is quite busy ! --- Additional comment from Matthew Garrett on 2013-03-30 14:07:29 EDT --- I'll see if I can get a full description of the reproduction case set up and give 1.0.4 a go - it'll be some time next week. --- Additional comment from Matthew Garrett on 2013-11-25 13:40:46 EST --- Still seeing this with 1.1.4, in exactly the same circumstances. This is while we're doing load testing, so there's a large number of instances being created and destroyed at around the same time. I don't have a trivial reproduction case. --- Additional comment from Dave Allan on 2013-11-25 13:56:14 EST --- Roughly how often are you seeing this and are you willing to install test builds to try to identify the source? --- Additional comment from Matthew Garrett on 2013-11-25 14:00:18 EST --- 2 or 3 days under heavy load is enough to trigger it. This is a test environment, so I can test patches. The cause seems to be that the virDomainCreateWithFlags()→_virNWFilterInstantiateFilter() path calls virObjectLock() and then virNWFilterLockFilterUpdates(), while the remoteDispatchNWFilterUndefine()→virNWFilterDomainFWUpdateCB() path calls virNWFilterLockFilterUpdates() and then virObjectLock(). --- Additional comment from Daniel Berrange on 2013-11-26 05:56:07 EST --- Confirmed from inspection that the lock ordering is fubar here. In addition to the nwfilterUndefine method, the nwfilterDefineXML will suffer the same flaw. The code naively assumed that making the nwfilter mutex recursive would avoid the issuing, ignoring the fact that the domain lock filter is not recursive. The code should have been written to avoid the recursively locking completely.
This problem was only introduced when the libvirt QEMU driver removed the global driver lock. This happened well after the 0.10.2 version that's in RHEL-6, so we're not affected here.