929412 – libvirt 1.0.3 appears to deadlock

Bug 929412 - libvirt 1.0.3 appears to deadlock

Summary: libvirt 1.0.3 appears to deadlock

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Virtualization Tools
Classification:	Community
Component:	libvirt
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Daniel Berrangé
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1034806 1034807
TreeView+	depends on / blocked

Reported:	2013-03-30 12:23 UTC by Matthew Garrett
Modified:	2014-02-19 07:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Clones:	1034806 1034807 (view as bug list)
Environment:
Last Closed:	2014-02-04 10:38:48 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Backtrace from libvirt (98.83 KB, text/plain) 2013-03-30 12:23 UTC, Matthew Garrett	no flags	Details
View All

Description Matthew Garrett 2013-03-30 12:23:10 UTC

Created attachment 718240 [details]
Backtrace from libvirt

We're seeing deadlocks under 1.0.3. I'll attach a traceback, but it looks like virNWFilterDomainFWUpdateCB is trying to take a lock on an object while holding updateMutex (and blocking), and virNWFilterInstantiateFilter is trying to take updateMutex. We didn't see this in 1.0.2. 37abd471656957c76eac687ce2ef94d79c8e2731 seems like a plausible candidate?

Comment 1 Daniel Veillard 2013-03-30 16:46:08 UTC

Hum, I didn't see an obvious patch for such an issue in the git commits since
v1.0.3, but if you have time giving a try to 1.0.4-rc2 it is available
at ftp://libvirt.org/libvirt/

Thanks for the backtrace, I see a thread in qemuNetworkIfaceConnect too
do you have a specific scenario to reproduce this ? That libvirtd is quite
busy !

Comment 2 Matthew Garrett 2013-03-30 18:07:29 UTC

I'll see if I can get a full description of the reproduction case set up and give 1.0.4 a go - it'll be some time next week.

Comment 3 Matthew Garrett 2013-11-25 18:40:46 UTC

Still seeing this with 1.1.4, in exactly the same circumstances. This is while we're doing load testing, so there's a large number of instances being created and destroyed at around the same time. I don't have a trivial reproduction case.

Comment 4 Dave Allan 2013-11-25 18:56:14 UTC

Roughly how often are you seeing this and are you willing to install test builds to try to identify the source?

Comment 5 Matthew Garrett 2013-11-25 19:00:18 UTC

2 or 3 days under heavy load is enough to trigger it. This is a test environment, so I can test patches. The cause seems to be that the virDomainCreateWithFlags()→_virNWFilterInstantiateFilter() path calls virObjectLock() and then virNWFilterLockFilterUpdates(), while the remoteDispatchNWFilterUndefine()→virNWFilterDomainFWUpdateCB() path calls virNWFilterLockFilterUpdates() and then virObjectLock().

Comment 6 Daniel Berrangé 2013-11-26 10:56:07 UTC

Confirmed from inspection that the lock ordering is fubar here. In addition to the  nwfilterUndefine method, the nwfilterDefineXML will suffer the same flaw. The code naively assumed that making the nwfilter mutex recursive would avoid the issuing, ignoring the fact that the domain lock filter is not recursive. The code should have been written to avoid the recursively locking completely.

Comment 7 Dave Allan 2013-12-10 19:06:59 UTC

Matthew, is this a torture test or is there a use case that's triggering this for you?

Comment 8 Matthew Garrett 2013-12-10 19:14:20 UTC

It's part of our release validation process rather than normal use, but it means we're stuck on 1.0.2.

Comment 9 Daniel Berrangé 2014-01-21 16:26:23 UTC

OpenStack appears to be suffering from this bug too https://bugs.launchpad.net/nova/+bug/1228977

Comment 10 Davanum Srinivas 2014-01-21 21:35:43 UTC

The OpenStack bug is using the following package/versions from Ubuntu Cloud Archive

libvirt-bin (1.1.1-0ubuntu8~cloud2)
python-libvirt (1.1.1-0ubuntu8~cloud2)

-- dims

Comment 11 Laine Stump 2014-01-22 10:07:04 UTC

This bug will be my top priority as soon as I clean up one thing for netcf, which should happen before the end of today.

In short, the problem is that NWFilter code has a loop at the end of undefining/defining any filter that tries to lock each domain *while the nwfilter lock is held*. Likewise, while a domain is being started/stopped, it locks the domain, then at some point tries to grab the NWFilter lock.

We either need to move the domain cleanup at the end of a NWFilter define/undefine to outside the NWFilter lock (re-locking and checking for updates after getting the domain lock), move domain code calling NWFilter code to outside the domain lock (again, re-locking and checking for changes to domain status after getting the nwfilter lock), or come up with something else more complicated.

Comment 12 Davanum Srinivas 2014-01-22 16:36:03 UTC

@Lanie, 

Thanks. 

per @berrange's request, I've recreated the problem with server side logging enabled. please see libvirtd.txt.gz at the following url:

http://logs.openstack.org/64/67564/6/check/check-tempest-dsvm-full/dd0e0de/logs/

The screen-n-cpu.txt.gz has the client-side logs as well

Comment 13 Laine Stump 2014-02-04 10:38:48 UTC

Daniel fixed this Upstream:

commit c065984b58000a44c90588198d222a314ac532fd
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 22 15:26:21 2014 +0000

    Add a read/write lock implementation

commit 6e5c79a1b5a8b3a23e7df7ffe58fb272aa17fbfb
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 22 17:28:29 2014 +0000

    Push nwfilter update locking up to top level

The following two patches may also be needed (between the 1st and 2nd patches above) if building for a Windows target:

commit ab6979430a750603464eb55925647c15c20e001f
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 29 13:54:11 2014 +0000

    Fix pthread_sigmask check for mingw32 without winpthreads

commit 0240d94c36c8ce0e7c35b5be430acd9ebf5adcfa
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 22 16:17:10 2014 +0000

    Remove windows thread implementation in favour of pthreads

Comment 14 Laine Stump 2014-02-06 16:07:49 UTC

I have now backported and pushed these patches to all of the upstream git -maint branches as far back as v1.0.3-maint, so any downstream release should be able to just pull the appropriate -maint and rebuild to fix the problem.

Comment 15 Wangpan 2014-02-19 07:15:10 UTC

after I cherry-picked these to patches, I got a new bug, it may be caused by these patch, but I have cherry-picked other patches of the 1.1.4-manit, so I'm not sure:
https://bugzilla.redhat.com/show_bug.cgi?id=1066801

Note You need to log in before you can comment on or make changes to this bug.