Bug 1034807

Summary: deadlock in nwfilter code under heavy load
Product: Red Hat Enterprise Linux 7 Reporter: Laine Stump <laine>
Component: libvirtAssignee: Laine Stump <laine>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0CC: acathrow, bdpayne, berrange, dallan, dyuan, gsun, honzhang, jmiao, laine, mjg59, mzhan, veillard
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-1.1.1-22.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 929412 Environment:
Last Closed: 2014-06-13 12:41:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 929412    
Bug Blocks: 1034806    
Attachments:
Description Flags
backtrace from deadlocked libvirtd none

Description Laine Stump 2013-11-26 14:34:02 UTC
+++ This bug was initially created as a clone of Bug #929412 +++

We're seeing deadlocks under 1.0.3. I'll attach a traceback, but it looks like virNWFilterDomainFWUpdateCB is trying to take a lock on an object while holding updateMutex (and blocking), and virNWFilterInstantiateFilter is trying to take updateMutex. We didn't see this in 1.0.2. 37abd471656957c76eac687ce2ef94d79c8e2731 seems like a plausible candidate?

--- Additional comment from Daniel Veillard on 2013-03-30 12:46:08 EDT ---

Hum, I didn't see an obvious patch for such an issue in the git commits since
v1.0.3, but if you have time giving a try to 1.0.4-rc2 it is available
at ftp://libvirt.org/libvirt/

Thanks for the backtrace, I see a thread in qemuNetworkIfaceConnect too
do you have a specific scenario to reproduce this ? That libvirtd is quite
busy !

--- Additional comment from Matthew Garrett on 2013-03-30 14:07:29 EDT ---

I'll see if I can get a full description of the reproduction case set up and give 1.0.4 a go - it'll be some time next week.

--- Additional comment from Matthew Garrett on 2013-11-25 13:40:46 EST ---

Still seeing this with 1.1.4, in exactly the same circumstances. This is while we're doing load testing, so there's a large number of instances being created and destroyed at around the same time. I don't have a trivial reproduction case.

--- Additional comment from Dave Allan on 2013-11-25 13:56:14 EST ---

Roughly how often are you seeing this and are you willing to install test builds to try to identify the source?

--- Additional comment from Matthew Garrett on 2013-11-25 14:00:18 EST ---

2 or 3 days under heavy load is enough to trigger it. This is a test environment, so I can test patches. The cause seems to be that the virDomainCreateWithFlags()→_virNWFilterInstantiateFilter() path calls virObjectLock() and then virNWFilterLockFilterUpdates(), while the remoteDispatchNWFilterUndefine()→virNWFilterDomainFWUpdateCB() path calls virNWFilterLockFilterUpdates() and then virObjectLock().

--- Additional comment from Daniel Berrange on 2013-11-26 05:56:07 EST ---

Confirmed from inspection that the lock ordering is fubar here. In addition to the  nwfilterUndefine method, the nwfilterDefineXML will suffer the same flaw. The code naively assumed that making the nwfilter mutex recursive would avoid the issuing, ignoring the fact that the domain lock filter is not recursive. The code should have been written to avoid the recursively locking completely.

Comment 1 Laine Stump 2013-12-10 14:41:02 UTC
Created attachment 834797 [details]
backtrace from deadlocked libvirtd

Comment 2 Jincheng Miao 2014-01-03 07:37:23 UTC
Hi Laine,

Could you show the reproduce steps of this deadlock?

Comment 3 Laine Stump 2014-01-03 12:35:07 UTC
This bug was originally filed against upstream (Bug 929412) by Matthew Garrett. He may be able to offer more details about his test setup, but it looks like you should be able to trigger it by rapidly doing a lot of  start/destroy of domains which have <filterref> elements, while simultaneously defining/undefining nwfilter rules - the deadlock occurs if a domain operation (start/destroy, for example) happens to be holding the nwfilter lock at the time a nwfilter operation (nwfilter-define/nwfilter-undefine) tries to scan all domains to update their filters after adding/removing a filter from the list.

Comment 4 Jincheng Miao 2014-01-06 08:46:32 UTC
(In reply to Laine Stump from comment #3)
> This bug was originally filed against upstream (Bug 929412) by Matthew
> Garrett. He may be able to offer more details about his test setup, but it
> looks like you should be able to trigger it by rapidly doing a lot of 
> start/destroy of domains which have <filterref> elements, while
> simultaneously defining/undefining nwfilter rules - the deadlock occurs if a
> domain operation (start/destroy, for example) happens to be holding the
> nwfilter lock at the time a nwfilter operation
> (nwfilter-define/nwfilter-undefine) tries to scan all domains to update
> their filters after adding/removing a filter from the list.

Thanks Laine, I can reproduce it with your suggestion.

Comment 6 Daniel Berrangé 2014-02-03 10:45:48 UTC
Upstream fixed in commits

commit 6e5c79a1b5a8b3a23e7df7ffe58fb272aa17fbfb
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 22 17:28:29 2014 +0000

    Push nwfilter update locking up to top level

commit c065984b58000a44c90588198d222a314ac532fd
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 22 15:26:21 2014 +0000

    Add a read/write lock implementation

Comment 7 Laine Stump 2014-02-04 14:53:42 UTC
backported patches have been tested on RHEL7 - deadlock easily occurred without the patches, and was eliminated with them. Posted to rhvirt-patches:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2014-February/msg00079.html

Comment 9 hongming 2014-02-11 07:53:22 UTC
Verify the bug as follows. The result is expected. Move its status to VERIFIED.

Versions
libvirt-1.1.1-22.el7.x86_64


Steps
1. prepare a nwfilter xml and define it

# cat disallow-arp.xml 
<filter name='disallow-arp' chain='arp'>
  <rule action='drop' direction='inout' priority='500'/>
</filter>


# virsh nwfilter-define disallow-arp.xml 
Network filter disallow-arp defined from disallow-arp.xml

2. modify guest xml

# virsh dumpxml r6.4
<domain type='kvm'>
......

    <interface type='network'>
      <mac address='52:54:00:e7:33:6c'/>
      <source network='default1'/>
      <model type='rtl8139'/>
      <filterref filter='disallow-arp'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
......
</domain>

3. execute start and destroy guest operation

# while true ; do virsh start r6.4 ; sleep 1 ; virsh destroy r6.4 ; done


4. In another terminal, execute define and undefine nwfilter operation

# while true; do virsh nwfilter-undefine disallow-arp; sleep 1; virsh nwfilter-define disallow-arp.xml; done


Result
No deadlock occurs, libvirt works fine.

Comment 10 Ludek Smid 2014-06-13 12:41:02 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.