+++ This bug was initially created as a clone of Bug #929412 +++
We're seeing deadlocks under 1.0.3. I'll attach a traceback, but it looks like virNWFilterDomainFWUpdateCB is trying to take a lock on an object while holding updateMutex (and blocking), and virNWFilterInstantiateFilter is trying to take updateMutex. We didn't see this in 1.0.2. 37abd471656957c76eac687ce2ef94d79c8e2731 seems like a plausible candidate?
--- Additional comment from Daniel Veillard on 2013-03-30 12:46:08 EDT ---
Hum, I didn't see an obvious patch for such an issue in the git commits since
v1.0.3, but if you have time giving a try to 1.0.4-rc2 it is available
Thanks for the backtrace, I see a thread in qemuNetworkIfaceConnect too
do you have a specific scenario to reproduce this ? That libvirtd is quite
--- Additional comment from Matthew Garrett on 2013-03-30 14:07:29 EDT ---
I'll see if I can get a full description of the reproduction case set up and give 1.0.4 a go - it'll be some time next week.
--- Additional comment from Matthew Garrett on 2013-11-25 13:40:46 EST ---
Still seeing this with 1.1.4, in exactly the same circumstances. This is while we're doing load testing, so there's a large number of instances being created and destroyed at around the same time. I don't have a trivial reproduction case.
--- Additional comment from Dave Allan on 2013-11-25 13:56:14 EST ---
Roughly how often are you seeing this and are you willing to install test builds to try to identify the source?
--- Additional comment from Matthew Garrett on 2013-11-25 14:00:18 EST ---
2 or 3 days under heavy load is enough to trigger it. This is a test environment, so I can test patches. The cause seems to be that the virDomainCreateWithFlags()→_virNWFilterInstantiateFilter() path calls virObjectLock() and then virNWFilterLockFilterUpdates(), while the remoteDispatchNWFilterUndefine()→virNWFilterDomainFWUpdateCB() path calls virNWFilterLockFilterUpdates() and then virObjectLock().
--- Additional comment from Daniel Berrange on 2013-11-26 05:56:07 EST ---
Confirmed from inspection that the lock ordering is fubar here. In addition to the nwfilterUndefine method, the nwfilterDefineXML will suffer the same flaw. The code naively assumed that making the nwfilter mutex recursive would avoid the issuing, ignoring the fact that the domain lock filter is not recursive. The code should have been written to avoid the recursively locking completely.
Created attachment 834797 [details]
backtrace from deadlocked libvirtd
Could you show the reproduce steps of this deadlock?
This bug was originally filed against upstream (Bug 929412) by Matthew Garrett. He may be able to offer more details about his test setup, but it looks like you should be able to trigger it by rapidly doing a lot of start/destroy of domains which have <filterref> elements, while simultaneously defining/undefining nwfilter rules - the deadlock occurs if a domain operation (start/destroy, for example) happens to be holding the nwfilter lock at the time a nwfilter operation (nwfilter-define/nwfilter-undefine) tries to scan all domains to update their filters after adding/removing a filter from the list.
(In reply to Laine Stump from comment #3)
> This bug was originally filed against upstream (Bug 929412) by Matthew
> Garrett. He may be able to offer more details about his test setup, but it
> looks like you should be able to trigger it by rapidly doing a lot of
> start/destroy of domains which have <filterref> elements, while
> simultaneously defining/undefining nwfilter rules - the deadlock occurs if a
> domain operation (start/destroy, for example) happens to be holding the
> nwfilter lock at the time a nwfilter operation
> (nwfilter-define/nwfilter-undefine) tries to scan all domains to update
> their filters after adding/removing a filter from the list.
Thanks Laine, I can reproduce it with your suggestion.
Upstream fixed in commits
Author: Daniel P. Berrange <email@example.com>
Date: Wed Jan 22 17:28:29 2014 +0000
Push nwfilter update locking up to top level
Author: Daniel P. Berrange <firstname.lastname@example.org>
Date: Wed Jan 22 15:26:21 2014 +0000
Add a read/write lock implementation
backported patches have been tested on RHEL7 - deadlock easily occurred without the patches, and was eliminated with them. Posted to rhvirt-patches:
Verify the bug as follows. The result is expected. Move its status to VERIFIED.
1. prepare a nwfilter xml and define it
# cat disallow-arp.xml
<filter name='disallow-arp' chain='arp'>
<rule action='drop' direction='inout' priority='500'/>
# virsh nwfilter-define disallow-arp.xml
Network filter disallow-arp defined from disallow-arp.xml
2. modify guest xml
# virsh dumpxml r6.4
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
3. execute start and destroy guest operation
# while true ; do virsh start r6.4 ; sleep 1 ; virsh destroy r6.4 ; done
4. In another terminal, execute define and undefine nwfilter operation
# while true; do virsh nwfilter-undefine disallow-arp; sleep 1; virsh nwfilter-define disallow-arp.xml; done
No deadlock occurs, libvirt works fine.
This request was resolved in Red Hat Enterprise Linux 7.0.
Contact your manager or support representative in case you have further questions about the request.