Bug 861069

Summary: Potential libvirtd segfault in netcf driver
Product: [Fedora] Fedora Reporter: Vladislav Bogdanov <bubble>
Component: libvirtAssignee: Libvirt Maintainers <libvirt-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 17CC: berrange, clalancette, crobinso, itamar, jforbes, jyang, laine, libvirt-maint, veillard, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-01 10:48:49 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
Rough patch to fix mutex issue none

Description Vladislav Bogdanov 2012-09-27 09:08:40 EDT
Description of problem:
netcf functions are protected by mutex and all ncf_* calls are made under lock.
mutex is created in interfaceOpenInterface().
interfaceOpenInterface() is called in daemon by every new connection in do_open() (called from virConnectOpen()). I may be missing full code flow because of RPC implementation complexities and a long time after I discovered this problem, but idea should be clean.
So, actually all ncf_*() calls are not protected, because every connection uses own mutex rather then the global one. That results in SEGFAULT if several connections use interface driver at the same time. I called that "potential" in subject only because I currently have no possibility to get backtraces.

Version-Release number of selected component (if applicable):
0.8.?? - 0.10.2
Comment 1 Vladislav Bogdanov 2012-09-27 09:12:01 EDT
Created attachment 618048 [details]
Rough patch to fix mutex issue

Attached patch fixes issue by moving mutex initialization from interfaceOpenInterface() to interfaceRegister().
Note: patch may fuzz because it is hand-edited from previous libvirt versions, but applies fine to 0.10.2
Comment 2 Laine Stump 2012-09-27 13:34:32 EDT
No global driver lock is needed, because netcf itself is re-entrant (as long as each thread separately calls ncf_init, and uses the netcf object that it returns for all subsequent calls).


If you are experiencing a problem that is due to multiple libvirt threads being in netcf at once, then the problem is much more likely due to:

1) a re-entrancy bug in netcf or one of its libraries (we have previously found thread-unsafe code in both augeas and libnl as a result of calling them from netcf), or in netcf itself.

2) a bug in the per-connection locking that is in libvirt.

Have you experienced a crash that you can attribute to multi-threaded use of virInterface functions? Can you describe it in more detail? Even if you're unable to get a backtrace, perhaps a description of how you trigger it could aid someone else in debugging.
Comment 3 Vladislav Bogdanov 2012-09-28 01:58:26 EDT
(In reply to comment #2)
> No global driver lock is needed, because netcf itself is re-entrant (as long
> as each thread separately calls ncf_init, and uses the netcf object that it
> returns for all subsequent calls).

1. Is that true for ncf_init() itself?
2. What then that per-thread mutex protects from?

> If you are experiencing a problem that is due to multiple libvirt threads
> being in netcf at once, then the problem is much more likely due to:
> 
> 1) a re-entrancy bug in netcf or one of its libraries (we have previously
> found thread-unsafe code in both augeas and libnl as a result of calling
> them from netcf), or in netcf itself.
> 
> 2) a bug in the per-connection locking that is in libvirt.

Perfectly correct. I bet 1) is true, see below.

> Have you experienced a crash that you can attribute to multi-threaded use of
> virInterface functions? Can you describe it in more detail?

Yes, as long as system has plenty of interfaces (I use bridges on top of vlans on top of LACP bond on top of two intel cards) and many connections are handled constantly.

> Even if you're
> unable to get a backtrace, perhaps a description of how you trigger it could
> aid someone else in debugging.

System has more than dozen of bridges which are listed as intrfaces.
System runs pacemaker which has resources for several disk pools (4 in my case), some are LVM based (clvm), some are filesystem-based (CIFS).
Also, several dozens of virtual machines is run over the cluster, so every node runs 10-20 VMs.
Pacemaker monitors each libvirt-related resource several times per minute (2-6).
Each monitor operation results in connection being made to local libvirtd daemon, so I'd say libvirtd load in terms of connection handling (connect, do the request, disconnect) is moderate.

I should admit that no calls are made directly to 'interface' driver, but driver is initialized once per connection.

If I recall that correctly, SIGSEGVs I saw happened somewhere inside of ncf_init(). So, may be only that function should be serialized, not all calls.
I saw that crashes several times per day after installation of first libvirt release with interface driver (iirc it was added after 0.8.1).
After I moved mutex to global space and serialized calls to netcf, libvirtd does not crash at all anymore (for >6 months).
Comment 4 Vladislav Bogdanov 2012-09-28 02:02:26 EDT
(In reply to comment #3)
> I saw that crashes several times per day after installation of first libvirt
> release with interface driver (iirc it was added after 0.8.1).

s/with interface driver/compiled with netcf support/
Comment 5 Daniel Berrange 2012-09-28 03:49:39 EDT
> I saw that crashes several times per day after installation of first libvirt
> release with interface driver (iirc it was added after 0.8.1).
> After I moved mutex to global space and serialized calls to netcf, libvirtd
> does not crash at all anymore (for >6 months).

I'm willing to be that those crashes were due to the libnl thread safety  bugs we discovered.
Comment 6 Vladislav Bogdanov 2012-10-01 05:14:59 EDT
Are that bugs fixed?
I see only 1.1 in koji. And nothing relevant in a changelog.
Comment 7 Cole Robinson 2012-10-19 15:54:27 EDT
Vladislav, in F18 at least libvirt uses libnl3, which is a different package.

Can you drop your local patch, reproduce one of the crashes, and post the stack trace here so we can confirm if this is a known issue?
Comment 8 Cole Robinson 2012-11-01 10:48:49 EDT
Okay, i'm just assuming modern libnl fixes this. Closing as CURRENTRELEASE