Created attachment 718240 [details] Backtrace from libvirt We're seeing deadlocks under 1.0.3. I'll attach a traceback, but it looks like virNWFilterDomainFWUpdateCB is trying to take a lock on an object while holding updateMutex (and blocking), and virNWFilterInstantiateFilter is trying to take updateMutex. We didn't see this in 1.0.2. 37abd471656957c76eac687ce2ef94d79c8e2731 seems like a plausible candidate?
Hum, I didn't see an obvious patch for such an issue in the git commits since v1.0.3, but if you have time giving a try to 1.0.4-rc2 it is available at ftp://libvirt.org/libvirt/ Thanks for the backtrace, I see a thread in qemuNetworkIfaceConnect too do you have a specific scenario to reproduce this ? That libvirtd is quite busy !
I'll see if I can get a full description of the reproduction case set up and give 1.0.4 a go - it'll be some time next week.
Still seeing this with 1.1.4, in exactly the same circumstances. This is while we're doing load testing, so there's a large number of instances being created and destroyed at around the same time. I don't have a trivial reproduction case.
Roughly how often are you seeing this and are you willing to install test builds to try to identify the source?
2 or 3 days under heavy load is enough to trigger it. This is a test environment, so I can test patches. The cause seems to be that the virDomainCreateWithFlags()→_virNWFilterInstantiateFilter() path calls virObjectLock() and then virNWFilterLockFilterUpdates(), while the remoteDispatchNWFilterUndefine()→virNWFilterDomainFWUpdateCB() path calls virNWFilterLockFilterUpdates() and then virObjectLock().
Confirmed from inspection that the lock ordering is fubar here. In addition to the nwfilterUndefine method, the nwfilterDefineXML will suffer the same flaw. The code naively assumed that making the nwfilter mutex recursive would avoid the issuing, ignoring the fact that the domain lock filter is not recursive. The code should have been written to avoid the recursively locking completely.
Matthew, is this a torture test or is there a use case that's triggering this for you?
It's part of our release validation process rather than normal use, but it means we're stuck on 1.0.2.
OpenStack appears to be suffering from this bug too https://bugs.launchpad.net/nova/+bug/1228977
The OpenStack bug is using the following package/versions from Ubuntu Cloud Archive libvirt-bin (1.1.1-0ubuntu8~cloud2) python-libvirt (1.1.1-0ubuntu8~cloud2) -- dims
This bug will be my top priority as soon as I clean up one thing for netcf, which should happen before the end of today. In short, the problem is that NWFilter code has a loop at the end of undefining/defining any filter that tries to lock each domain *while the nwfilter lock is held*. Likewise, while a domain is being started/stopped, it locks the domain, then at some point tries to grab the NWFilter lock. We either need to move the domain cleanup at the end of a NWFilter define/undefine to outside the NWFilter lock (re-locking and checking for updates after getting the domain lock), move domain code calling NWFilter code to outside the domain lock (again, re-locking and checking for changes to domain status after getting the nwfilter lock), or come up with something else more complicated.
@Lanie, Thanks. per @berrange's request, I've recreated the problem with server side logging enabled. please see libvirtd.txt.gz at the following url: http://logs.openstack.org/64/67564/6/check/check-tempest-dsvm-full/dd0e0de/logs/ The screen-n-cpu.txt.gz has the client-side logs as well
Daniel fixed this Upstream: commit c065984b58000a44c90588198d222a314ac532fd Author: Daniel P. Berrange <berrange> Date: Wed Jan 22 15:26:21 2014 +0000 Add a read/write lock implementation commit 6e5c79a1b5a8b3a23e7df7ffe58fb272aa17fbfb Author: Daniel P. Berrange <berrange> Date: Wed Jan 22 17:28:29 2014 +0000 Push nwfilter update locking up to top level The following two patches may also be needed (between the 1st and 2nd patches above) if building for a Windows target: commit ab6979430a750603464eb55925647c15c20e001f Author: Daniel P. Berrange <berrange> Date: Wed Jan 29 13:54:11 2014 +0000 Fix pthread_sigmask check for mingw32 without winpthreads commit 0240d94c36c8ce0e7c35b5be430acd9ebf5adcfa Author: Daniel P. Berrange <berrange> Date: Wed Jan 22 16:17:10 2014 +0000 Remove windows thread implementation in favour of pthreads
I have now backported and pushed these patches to all of the upstream git -maint branches as far back as v1.0.3-maint, so any downstream release should be able to just pull the appropriate -maint and rebuild to fix the problem.
after I cherry-picked these to patches, I got a new bug, it may be caused by these patch, but I have cherry-picked other patches of the 1.1.4-manit, so I'm not sure: https://bugzilla.redhat.com/show_bug.cgi?id=1066801