617291 – Use of global variables is not thread safe in libnl

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 617291 - Use of global variables is not thread safe in libnl

Summary: Use of global variables is not thread safe in libnl

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libnl
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Dan Williams
QA Contact:	desktop-bugs@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	616696 (view as bug list)
Depends On:
Blocks:	616696
TreeView+	depends on / blocked

Reported:	2010-07-22 17:03 UTC by Daniel Berrangé
Modified:	2018-10-27 12:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:	libnl-1.1-12.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-11-10 21:03:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
test program to easily demonstrate the problem (1.30 KB, text/x-csrc) 2010-07-30 22:11 UTC, Laine Stump	no flags	Details
patch to libnl to eliminate the crash in question (463 bytes, patch) 2010-07-30 22:16 UTC, Laine Stump	no flags	Details \| Diff
View All

Description Daniel Berrangé 2010-07-22 17:03:10 UTC

Description of problem:
While investigating a crash in libvirt in libnl

#0  0x0000003ca94329c5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003ca94341a5 in abort () at abort.c:92
#2  0x0000003ca946fe2b in __libc_message (do_abort=2, 
    fmt=0x3ca9543a58 "*** glibc detected *** %s: %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:186
#3  0x0000003ca9475746 in malloc_printerr (action=3, str=0x3ca9543de0 "double free or corruption (fasttop)", 
    ptr=<value optimized out>) at malloc.c:6283
#4  0x0000003cb941e973 in __nl_error (err=1, file=<value optimized out>, line=<value optimized out>, 
    func=<value optimized out>, fmt=0x3cb943b90a "bind() failed") at utils.c:60
#5  0x0000003cb941c8e6 in nl_connect (handle=0x7f3b940ec6c0, protocol=0) at nl.c:211

It appears that the 'nl_error' function in utils.c is not multi-threadsafe:


static char *errbuf;
static int nlerrno;

/** @cond SKIP */
int __nl_error(int err, const char *file, unsigned int line, const char *func,
               const char *fmt, ...)
{
        char *user_err;
        va_list args;

        if (errbuf) {
                free(errbuf);
                errbuf = NULL;
        }

        nlerrno = err;


Since 'errbuf' is a static global variable, if two threads happened to run 'nl_error' at the same time, they will clash, potentially doing a double-free + crash.

There is another problem with global variables  in socket.c with the 'used_ports_map' array. This can be unsafely accessed/updated from many threads at once.
 

Version-Release number of selected component (if applicable):
libnl-1.1-11.el6.x86_64

How reproducible:
Difficult since this is a multithread race condition

Steps to Reproduce:
1. No reliably reproduceable steps
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 RHEL Program Management 2010-07-22 17:37:49 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 Dan Williams 2010-07-30 04:45:01 UTC

SO the upstream fix for this was:

http://git.kernel.org/?p=libs/netlink/libnl.git;a=commit;h=8a3efffa5b3fde252675239914118664d36a2c24

which is a pretty big API/ABI break for libnl.  We're still shipping a pretty old version of libnl (not that libnl has active release management) and this is obviously not fixed in our version.

So perhaps we can work around it with pthread locking or something for the time being?

Comment 4 Laine Stump 2010-07-30 06:20:31 UTC

putting a lock around calls to libnl isn't really an option, because two separate libraries used by libvirt make calls to libnl (ie, there is no central controlling authority). And as you say, updating libnl, with the API breakage, is a no-go. That's why the discussion we had ended up with putting the static pointer used by nl_error into thread-local storage.

I just tried defining errbuf as static __thread and libnl built with no problem.
(I had no idea that thread-local storage had gotten so simple). I'm going to try and cook up a way to get a lot of libnl failures in multiple threads and see if this helps the situation.

Comment 5 Laine Stump 2010-07-30 22:11:13 UTC

Created attachment 435691 [details]
test program to easily demonstrate the problem

In order to verify that any proposed change actually solves the problem, I created this test program which produces the same crash seen in libvirt 90% of the time (it *always* crashes in < 20 seconds, sometimes the backtrace is slightly different).

Just compile it like this:

 cc -g -lnl -lpthread -o nl_error_crash nl_error_crash.c

and run it. It creates 20 threads that each begin leaking nl_sockets as fast as they can create them. Very quickly, all 20 threads encounter libnl error conditions, and wait 1/10th of a second before printing out their error.

(Note that even before the program crashes, the error text printed out is often gibberish).

Comment 6 Laine Stump 2010-07-30 22:16:17 UTC

Created attachment 435692 [details]
patch to libnl to eliminate the crash in question

This patch changes errbuf and nlerrno to be threadlocal data by declaring them with "__thread".

With a patched libnl installed. The test program in the previous attachment routinely runs to completion. I haven't gotten it to crash even once, and the error messages are always correct.

I would post this, but don't know what list it should go to.

Comment 7 Laine Stump 2010-08-03 19:42:08 UTC

Based on our investigations and discussion (detailed in the email included below) the libvirt team believes it is safe to change errbuf and nlerrno threadlocal.

We also need to patch Fedora's libnl appropriately (at least back as far as F12), and notify upstream libnl for completeness (although upstream has moved beyond this problem with an API change).

What is the next step required? We *really* want this in as soon as possible, because it's causing stress tests to crash.

On 08/03/2010 04:31 AM, Daniel Veillard wrote:
> [ Adding Dan William in the loop, Dan this is about
>    https://bugzilla.redhat.com/show_bug.cgi?id=617291
>    where libnl being not thread safe crashes libvirt daemon ]
>
> On Mon, Aug 02, 2010 at 05:27:57PM +1000, Justin Clift wrote:
>> On 08/02/2010 04:00 PM, Laine Stump wrote:
>> <snip>
>>> An interesting point. Does someone have a simple way I can find all uses
>>> of libnl in RHEL and Fedora packages?
>> That sounds like the kind of thing repoquery may be useful for.
>>  From a F13 system:
> [...]
>>  From a RHEL 6 beta 2+ system:
>>
>> $ repoquery --whatrequires libnl
>>    libnl-0:1.1-11.el6.i686
>>    libnl-0:1.1-11.el6.x86_64
>>    libnl-devel-0:1.1-11.el6.x86_64
>>    anaconda-0:13.21.65-1.el6.x86_64
>>    libvirt-client-0:0.8.1-20.el6.i686
>>    dropwatch-0:1.2-0.el6.x86_64
>>    libvirt-client-0:0.8.1-20.el6.x86_64
>>    netlabel_tools-0:0.19-6.el6.x86_64
>>    crda-0:1.1.1_2009.11.25-3.el6.x86_64
>>    netcf-0:0.1.6-4.el6.x86_64
>>    NetworkManager-1:0.8.1-4.el6.x86_64
>>    libvirt-python-0:0.8.1-20.el6.x86_64
>>    iw-0:0.9.17-4.el6.x86_64
>>    libnl-devel-0:1.1-11.el6.i686
>>    libvirt-0:0.8.1-20.el6.x86_64
>>    netcf-libs-0:0.1.6-4.el6.i686
>>    netcf-libs-0:0.1.6-4.el6.x86_64
>>    quota-1:3.17-10.el6.x86_64
>>    $
>>
>> Hope that's useful. :)
>    Very !
>
> I think we are lucky, basically libnl doesn't seem used by other
> libraries beside libvirt that mean that if we are fine with the
> patch, beside our netcf/libvirt stack, there isn't that much
> to double check:
>
>    anaconda: that's the big beast
>              Certainly multithreaded but change in error reporting
>              for netlink problems sounds unlikely to affect its normal
>              behaviour, but to be checked with maintainer.
>
>    crda: daemon but not multithreaded
>         [root@test2 ~]# ldd /sbin/crda | grep thread
>         [root@test2 ~]#
>
>    netlabel_tools: not multithread either
>         [root@test2 ~]# ldd /sbin/netlabelctl | grep thread
>         [root@test2 ~]#
>
>    iw: command line wireless setup tool not multithreaded
>         [root@test2 ~]# ldd /sbin/iw | grep thread
>         [root@test2 ~]#
>
>    dropwatch: monitoring tool for dropped packet, not multithreaded
>         [root@test2 ~]# ldd /usr/bin/dropwatch | grep thread
>         [root@test2 ~]#
>
>    quota: in the quota set of binaries, only /usr/sbin/quota_nld
>           is using both libnl and libpthread
>
>         [root@test2 ~]# for i in `rpm -qil quota | grep bin ` ; do if [
>         "`ldd $i | grep libnl`" != "" ] ; then echo $i ; ldd $i | grep
>         thread ; fi; done
>         /usr/sbin/quota_nld
>          libpthread.so.0 =>  /lib64/libpthread.so.0 (0x00007facaabfb000)
>         [root@test2 ~]#
>
>         looking at the code the libnl library is accessed in a single
>         threaded fashion, looping reading one message, processing it
>         and waiting for next one:
>
>         static void run(struct nl_handle *nhandle) line 357
>
>         http://www.sfr-fresh.com/linux/misc/quota-3.17.tar.gz:a/quota-tools/quota_nld.c
>
>         I guess the libpthread use comes from the dbus backend used by
>         the daemon.
>
> Not unsurprizing, libnl is not used by many multithreaded apps,
> which is why the deficiency has not been raised ... until now.
> So except for anaconda use, I feel fairly confident with pushing that
> libnl patch to RHEL-6 code base

Looking from a different angle - I did a cvs co and make prep of all those packages, then searched for uses of nl_get_errno and nl_geterror  (which are the only two functions that could use the two variables that I want to make threadlocal outside the scope of a single call to libnl).

Aside from libnl itself, only NetworkManager makes use of either of those. I went through all the calls to nl_geterror and nl_get_errno in libnl and NetworkManager, and they are all preceded by a call to an nl_* (or rtnl_*) that would set errbuf/nlerrno. In other words, I couldn't find any use of nl_geterror or nl_get_errno that assumed cross-thread integrity of errbuf or nlerrno.

based on that, it looks safe to me. (certainly safer than the current situation! ;-)

Comment 8 Dan Williams 2010-08-04 05:17:23 UTC

NM only uses one thread anyway, and any relevant libs that NM links to dont' use threading or don't use libnl.  So we should be pretty safe there.

I've requested the ACKs necessary to get this into RHEL, so it's up to PM & QE now.

Comment 9 James Laska 2010-08-04 15:40:20 UTC

Adding QA_ACK for rhel-6.0.0.  Thanks for the details Laine and Dan.  The scope of the change appears to be limited and a well understood.

laine: would you, or someone from the virt team be able to confirm this fix once built and available in RHEL6?  Seems like you guys have the reproducer in then virt space.  Otherwise, desktop-qa can confirm that the patch is applied and available in an updated package.

Comment 10 Laine Stump 2010-08-04 17:33:19 UTC

James,

I've grabbed the updated libnl package and verified that my reproducer test program (attached up above) which failed with old libnl can run successfully to completion with new libnl.

The original bug report in Bug 616696 was for a situation that I can't easily test myself, and may not be reproducible by the reporter either, as it only occurred durring a stress test as the result of leaking netlink resources in libvirt, and that leak has also been fixed.

I will make sure that the original bug points over to this one to document the resolution, though, and they can re-run their stress test to verify.

Comment 11 Dave Allan 2010-08-04 20:56:07 UTC

*** Bug 616696 has been marked as a duplicate of this bug. ***

Comment 15 releng-rhel@redhat.com 2010-11-10 21:03:58 UTC

Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.