Bug 617291
Summary: | Use of global variables is not thread safe in libnl | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Daniel Berrangé <berrange> | ||||||
Component: | libnl | Assignee: | Dan Williams <dcbw> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | desktop-bugs <desktop-bugs> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 6.0 | CC: | jlaska, laine, rjones, syeghiay, tao, tpelka | ||||||
Target Milestone: | rc | Keywords: | RHELNAK | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libnl-1.1-12.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2010-11-10 21:03:58 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 616696 | ||||||||
Attachments: |
|
Description
Daniel Berrangé
2010-07-22 17:03:10 UTC
This issue has been proposed when we are only considering blocker issues in the current Red Hat Enterprise Linux release. ** If you would still like this issue considered for the current release, ask your support representative to file as a blocker on your behalf. Otherwise ask that it be considered for the next Red Hat Enterprise Linux release. ** SO the upstream fix for this was: http://git.kernel.org/?p=libs/netlink/libnl.git;a=commit;h=8a3efffa5b3fde252675239914118664d36a2c24 which is a pretty big API/ABI break for libnl. We're still shipping a pretty old version of libnl (not that libnl has active release management) and this is obviously not fixed in our version. So perhaps we can work around it with pthread locking or something for the time being? putting a lock around calls to libnl isn't really an option, because two separate libraries used by libvirt make calls to libnl (ie, there is no central controlling authority). And as you say, updating libnl, with the API breakage, is a no-go. That's why the discussion we had ended up with putting the static pointer used by nl_error into thread-local storage. I just tried defining errbuf as static __thread and libnl built with no problem. (I had no idea that thread-local storage had gotten so simple). I'm going to try and cook up a way to get a lot of libnl failures in multiple threads and see if this helps the situation. Created attachment 435691 [details]
test program to easily demonstrate the problem
In order to verify that any proposed change actually solves the problem, I created this test program which produces the same crash seen in libvirt 90% of the time (it *always* crashes in < 20 seconds, sometimes the backtrace is slightly different).
Just compile it like this:
cc -g -lnl -lpthread -o nl_error_crash nl_error_crash.c
and run it. It creates 20 threads that each begin leaking nl_sockets as fast as they can create them. Very quickly, all 20 threads encounter libnl error conditions, and wait 1/10th of a second before printing out their error.
(Note that even before the program crashes, the error text printed out is often gibberish).
Created attachment 435692 [details]
patch to libnl to eliminate the crash in question
This patch changes errbuf and nlerrno to be threadlocal data by declaring them with "__thread".
With a patched libnl installed. The test program in the previous attachment routinely runs to completion. I haven't gotten it to crash even once, and the error messages are always correct.
I would post this, but don't know what list it should go to.
Based on our investigations and discussion (detailed in the email included below) the libvirt team believes it is safe to change errbuf and nlerrno threadlocal. We also need to patch Fedora's libnl appropriately (at least back as far as F12), and notify upstream libnl for completeness (although upstream has moved beyond this problem with an API change). What is the next step required? We *really* want this in as soon as possible, because it's causing stress tests to crash. On 08/03/2010 04:31 AM, Daniel Veillard wrote: > [ Adding Dan William in the loop, Dan this is about > https://bugzilla.redhat.com/show_bug.cgi?id=617291 > where libnl being not thread safe crashes libvirt daemon ] > > On Mon, Aug 02, 2010 at 05:27:57PM +1000, Justin Clift wrote: >> On 08/02/2010 04:00 PM, Laine Stump wrote: >> <snip> >>> An interesting point. Does someone have a simple way I can find all uses >>> of libnl in RHEL and Fedora packages? >> That sounds like the kind of thing repoquery may be useful for. >> From a F13 system: > [...] >> From a RHEL 6 beta 2+ system: >> >> $ repoquery --whatrequires libnl >> libnl-0:1.1-11.el6.i686 >> libnl-0:1.1-11.el6.x86_64 >> libnl-devel-0:1.1-11.el6.x86_64 >> anaconda-0:13.21.65-1.el6.x86_64 >> libvirt-client-0:0.8.1-20.el6.i686 >> dropwatch-0:1.2-0.el6.x86_64 >> libvirt-client-0:0.8.1-20.el6.x86_64 >> netlabel_tools-0:0.19-6.el6.x86_64 >> crda-0:1.1.1_2009.11.25-3.el6.x86_64 >> netcf-0:0.1.6-4.el6.x86_64 >> NetworkManager-1:0.8.1-4.el6.x86_64 >> libvirt-python-0:0.8.1-20.el6.x86_64 >> iw-0:0.9.17-4.el6.x86_64 >> libnl-devel-0:1.1-11.el6.i686 >> libvirt-0:0.8.1-20.el6.x86_64 >> netcf-libs-0:0.1.6-4.el6.i686 >> netcf-libs-0:0.1.6-4.el6.x86_64 >> quota-1:3.17-10.el6.x86_64 >> $ >> >> Hope that's useful. :) > Very ! > > I think we are lucky, basically libnl doesn't seem used by other > libraries beside libvirt that mean that if we are fine with the > patch, beside our netcf/libvirt stack, there isn't that much > to double check: > > anaconda: that's the big beast > Certainly multithreaded but change in error reporting > for netlink problems sounds unlikely to affect its normal > behaviour, but to be checked with maintainer. > > crda: daemon but not multithreaded > [root@test2 ~]# ldd /sbin/crda | grep thread > [root@test2 ~]# > > netlabel_tools: not multithread either > [root@test2 ~]# ldd /sbin/netlabelctl | grep thread > [root@test2 ~]# > > iw: command line wireless setup tool not multithreaded > [root@test2 ~]# ldd /sbin/iw | grep thread > [root@test2 ~]# > > dropwatch: monitoring tool for dropped packet, not multithreaded > [root@test2 ~]# ldd /usr/bin/dropwatch | grep thread > [root@test2 ~]# > > quota: in the quota set of binaries, only /usr/sbin/quota_nld > is using both libnl and libpthread > > [root@test2 ~]# for i in `rpm -qil quota | grep bin ` ; do if [ > "`ldd $i | grep libnl`" != "" ] ; then echo $i ; ldd $i | grep > thread ; fi; done > /usr/sbin/quota_nld > libpthread.so.0 => /lib64/libpthread.so.0 (0x00007facaabfb000) > [root@test2 ~]# > > looking at the code the libnl library is accessed in a single > threaded fashion, looping reading one message, processing it > and waiting for next one: > > static void run(struct nl_handle *nhandle) line 357 > > http://www.sfr-fresh.com/linux/misc/quota-3.17.tar.gz:a/quota-tools/quota_nld.c > > I guess the libpthread use comes from the dbus backend used by > the daemon. > > Not unsurprizing, libnl is not used by many multithreaded apps, > which is why the deficiency has not been raised ... until now. > So except for anaconda use, I feel fairly confident with pushing that > libnl patch to RHEL-6 code base Looking from a different angle - I did a cvs co and make prep of all those packages, then searched for uses of nl_get_errno and nl_geterror (which are the only two functions that could use the two variables that I want to make threadlocal outside the scope of a single call to libnl). Aside from libnl itself, only NetworkManager makes use of either of those. I went through all the calls to nl_geterror and nl_get_errno in libnl and NetworkManager, and they are all preceded by a call to an nl_* (or rtnl_*) that would set errbuf/nlerrno. In other words, I couldn't find any use of nl_geterror or nl_get_errno that assumed cross-thread integrity of errbuf or nlerrno. based on that, it looks safe to me. (certainly safer than the current situation! ;-) NM only uses one thread anyway, and any relevant libs that NM links to dont' use threading or don't use libnl. So we should be pretty safe there. I've requested the ACKs necessary to get this into RHEL, so it's up to PM & QE now. Adding QA_ACK for rhel-6.0.0. Thanks for the details Laine and Dan. The scope of the change appears to be limited and a well understood. laine: would you, or someone from the virt team be able to confirm this fix once built and available in RHEL6? Seems like you guys have the reproducer in then virt space. Otherwise, desktop-qa can confirm that the patch is applied and available in an updated package. James, I've grabbed the updated libnl package and verified that my reproducer test program (attached up above) which failed with old libnl can run successfully to completion with new libnl. The original bug report in Bug 616696 was for a situation that I can't easily test myself, and may not be reproducible by the reporter either, as it only occurred durring a stress test as the result of leaking netlink resources in libvirt, and that leak has also been fixed. I will make sure that the original bug points over to this one to document the resolution, though, and they can re-run their stress test to verify. *** Bug 616696 has been marked as a duplicate of this bug. *** Red Hat Enterprise Linux 6.0 is now available and should resolve the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you. |