896737 – Unable to cache kernel state in userspace libraries (at_clone() callback?)

Bug 896737 - Unable to cache kernel state in userspace libraries (at_clone() callback?)

Summary: Unable to cache kernel state in userspace libraries (at_clone() callback?)

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	19
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Carlos O'Donell
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-17 21:32 UTC by Eric Paris
Modified:	2017-10-31 15:13 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-02-17 14:41:10 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
example test code for clone(2) with TLS segfault (1.67 KB, text/plain) 2013-01-18 14:27 UTC, Eric Paris	no flags	Details
View All

Description Eric Paris 2013-01-17 21:32:35 UTC

We want to cache some per task kernel state information in the SELinux userspace library code.  We've found that this is impossible to do.  There appear to be 3 main ways to create a new kernel task:  fork(), pthread_create(), clone(2).  The library code I tried looks something like so (pseudo code):

static __thread char *cachedvalue;

static void clear(void) {
    cachedvalue = NULL;
}

void init_routine(void) {
    pthread_atfork(NULL, NULL, clear);
}

int setfscreatecon(char *newvalue) {
    if (!strcmp(cachedvalue, newvalue))
        return 0;
    set_value_in_kernel(newvalue);
    cachedvalue = newvalue;
}

This works perfectly for pthreads and programs which use fork.  But it fails for programs which use clone(2).  From my looking at the glibc code you have special cases in the code for RESET_PID during clone(2) to handle it.

To solve my problem I'd like to see that special case code turned into something more generic, like the pthread_atfork() callback.  Can we get an 'at_clone()'.  It would allow the RESET_PID to be done as a generic call back and others, who are willing to accept the same restrictions as getpid() and its raciness with signal handlers, could write clone(2) 'safe-ish' code as well?

Comment 1 Carlos O'Donell 2013-01-18 02:53:49 UTC

Could you describe more clearly what problem you're trying to solve? What does SELinux need?

Could you also describe in more detail what you mean by "But it fails for programs which use clone(2)?"

You don't need to clear `cachedvalue' after a clone, it's a TLS variable and will be initialized to zero anyway after the clone as part of the static TLS image setup for `.tbss'.

Once the clone call completes (assuming CLONE_SETTLS) all TLS variables point at the new thread's TLS storage.

The child can't access the parent's TLS data and therefore can't clear the `__thread'-type storage in the parent.

I don't see how the threading related RESET_PID has anything to do with this issue of resetting a value across a call.

After a clone to create a thread you should be able to safely call setfscreatecon(...) without any problems. The `newvalue' shouldn't match the `cachedvalue' and it should update the value in the kernel and cache it.

What problems are you seeing?

Comment 2 Daniel Berrangé 2013-01-18 09:29:08 UTC

FYI, this is in the context of LXC containers, which are started using the following clone flags:

CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|CLONE_NEWUSER|SIGCHLD;

Comment 3 Eric Paris 2013-01-18 14:27:41 UTC

Created attachment 682451 [details]
example test code for clone(2) with TLS segfault

I'm trying to figure out how to use CLONE_SETTLS (thank you for the pointer).  I've attached my test program.  If you just remove the SETTLS stuff you'll see my original problem.  New problem is that it appears to be taking a segfault after the clone...  So I'm doing something wrong.

libselinux wants to store a string in a __thread variable during a function call.  We use strcmp on the next call to bypass the kernel if possible.  Without SETTLS our clone call ends up seeing the value cached by the parent.

fork: pthread_atfork() solves the problem for me
pthread_create(): the __thread variable solves the problem
clone without SETTLS: we see the parent version (wish I could fix)
clone with SETTLS: I don't know how to get that working...  Can you fix my code?

Comment 4 Carlos O'Donell 2013-01-18 15:24:52 UTC

Daniel,

Thanks for that information.

Given the flags used by LXC with clone it is indeed impossible to do what you want efficiently without a new function like at_clone().

Would you accept a pthread_atclone_np() function? It would require linking against libpthread.so, and enabling threads, but it's the safest way to ensure that this function operates correctly in all cases including those where dependent libraries use threads. 

The pthread_atclone_np() would likely have the following semantics:

int pthread_atfork_np(void (*prepare)(...), void (*parent)(...), void (*child)(...));

Where `...' are all the arguments to the clone call, that way the parent or child that registers the handlers can inspect exactly what the child will be doing.

Note that traditionally the glibc stance has been that if you use clone yourself that you're in undefined territory. However, in the case of containers, there is no other way to implement this, and you want SELinux to work correctly in this case. I consider the LXC uses to be valid, and we need to find a way to enable them with SELinux.

Eric,

Using CLONE_SETTLS on x86/x86_64 and making it work is considerably more complicated than your example and involves some assembly to get working correctly. The flag is only part of what's required to implement a working TLS runtime for those architectures. You won't easily get that part of your testcase working without more effort. Even I would have to go back to the ABI and review the glibc code we use to ensure we duplicated the setup as expected by the ABI and the runtime. Worse is that you can't just initialize a block of memory, because glibc is also using TLS internally, and your block needs to initialize the static image for the application and loaded libraries. It's a heap of work.

You have to just assume that clone with CLONE_SETTLS using a *new* value for the TLS storage will work correctly *exactly* like the case of pthread_create() works correctly.

The only case you have to care about is clone *without* CLONE_SETTLS (and implicitly without CLONE_VM otherwise you'd be back in the thread scenario and it doesn't make sense for two threads to use the same TLS).

In the case of !CLONE_VM && !CLONE_SETTLS you need *something* to catch the clone call like you did with pthread_atfork().

The only tenable solution I can think of is adding something like pthread_atclone_np().

Comments?

Comment 5 Daniel Berrangé 2013-01-18 15:30:33 UTC

The downside of  pthread_atclone_np() is that it pushes pain onto every library, which needs to be ported  / made aware of the new APIs. Is there anything else that libvirt LXC could do to make the existing library work properly. eg is there a way libvirt could explicitly trigger the existing pthread_atfork() handlers to run after it has done a clone() ?

The only other option would I just thought of would be for libvirt to clone(CLONE_NEWPID) then immediately execve() another helper binary, and then do the rest of its LXC setup work, and then use fork()+unshare() to create the rest of the namespaces we need. This would avoid need for changes in selinux or glibc, but would need a bit of re-arranging code in libvirt.

Comment 6 Eric Paris 2013-01-18 15:36:31 UTC

We do know that in v3.8 kernel libvirt-lxc can use fork()+unshare() instead of custom clone, but that doesn't give us much support going backwards.

My plan would be to put the pthread_atclone_np() call in the libselinux library code, just like I have with atfork().  Thus old libvirt with new libselinux would work just fine.  I don't want to force changes on my users (aka libvirt)

Comment 7 Jakub Jelinek 2013-01-18 15:38:28 UTC

What you could do is alongside with the __thread value you cache also record in a __thread pid_t variable the result of getpid (), and upon use of the cache just check if that var is == getpid ().  getpid (), being cached in TLS, shouldn't result in a slow syscall.  Then, if the cached pid != getpid (), you'd invalidate the cache.

Comment 8 Eric Paris 2013-01-18 15:48:58 UTC

Jakub,
  That's what I was thinking of doing in the meantime.  (actually my code was using tid, but pid makes more sense).  It isn't perfect though.  What if init() does a clone with NEWPID?

Dan:

"The downside of  pthread_atclone_np() is that it pushes pain onto every library, which needs to be ported  / made aware of the new APIs. Is there anything else that libvirt LXC could do to make the existing library work properly."

If any library uses __thread variables today, they are already broken.  libselinux, in a couple of other places, uses __thread variables, but is not even fork() safe.

Rather than push the work onto me, the library writer who wants to introduce this new caching, you think we should do the work onto every caller of the library?

Comment 9 Daniel Berrangé 2013-01-18 15:51:58 UTC

(In reply to comment #8)
> "The downside of  pthread_atclone_np() is that it pushes pain onto every
> library, which needs to be ported  / made aware of the new APIs. Is there
> anything else that libvirt LXC could do to make the existing library work
> properly."
> 
> If any library uses __thread variables today, they are already broken. 
> libselinux, in a couple of other places, uses __thread variables, but is not
> even fork() safe.
> 
> Rather than push the work onto me, the library writer who wants to introduce
> this new caching, you think we should do the work onto every caller of the
> library?

My understanding is that __thread was only broken in face of clone() without CLONE_SETTLS. Since LXC is pretty much the only place where such code is used, you'd not be pushing work on to every user of the libselinux.

Comment 10 Eric Paris 2013-01-18 15:54:05 UTC

(In reply to comment #9)
> My understanding is that __thread was only broken in face of clone() without
> CLONE_SETTLS. Since LXC is pretty much the only place where such code is
> used, you'd not be pushing work on to every user of the libselinux.

Fair enough.  Although as you point out, systemd also uses clone(2) for the same reasons.

Comment 11 Jakub Jelinek 2013-01-18 15:59:43 UTC

I'd say clone(2) is pretty much unusable as an interface unless you are a library writer and take care of all the details.  I think it would be much better to have say a fork() like function that would accept some subset of clone(2) flags (and add some flags on its own), run normal atfork handlers etc.  The thing is, you just can't support all the combinations of clone flags allowed by the kernel, how is TLS set up is library implementation detail not exported to the world and requires lots of bookkeeping on the library side, so e.g. the TLS related flags clone flags just can't be specified right by other libraries or programs.  If you intend to share address space, people just should use pthread_create, nothing else, if you want to create a new process with some non-standard clone flags, let's make a new API for that, list what exact flags are actually supportable, etc.

Comment 12 Daniel Berrangé 2013-01-18 16:03:31 UTC

(In reply to comment #11)
> I'd say clone(2) is pretty much unusable as an interface unless you are a
> library writer and take care of all the details.  I think it would be much
> better to have say a fork() like function that would accept some subset of
> clone(2) flags (and add some flags on its own), run normal atfork handlers
> etc.  The thing is, you just can't support all the combinations of clone
> flags allowed by the kernel, how is TLS set up is library implementation
> detail not exported to the world and requires lots of bookkeeping on the
> library side, so e.g. the TLS related flags clone flags just can't be
> specified right by other libraries or programs.  If you intend to share
> address space, people just should use pthread_create, nothing else, if you
> want to create a new process with some non-standard clone flags, let's make
> a new API for that, list what exact flags are actually supportable, etc.

That does kind of make sense. I'd no real desire to use clone() - it was merely the only interface available to pass the extra namespace related flags. If there was a fork-like function that could accept the namespace related flags that'd be nicer, not least because I wouldn't have to deal with the tedious stack allocation + trampoline setup that you need with clone().

Comment 13 Carlos O'Donell 2013-01-18 18:38:11 UTC

So is the consensus:
* Create a fork-like function with namespace flag support, avoiding the serious problems with using clone.
* Extend pthread_atfork() to operate on fork() *and* the new fork-like function?

Seems like a safer solution than continuing to use clone along with a new pthread_atclone() system.

Comment 14 Eric Paris 2013-01-18 19:26:04 UTC

(In reply to comment #13)
> So is the consensus:
> * Create a fork-like function with namespace flag support, avoiding the
> serious problems with using clone.

I like it.

Comment 15 Carlos O'Donell 2013-01-19 13:13:17 UTC

I'm not going to be able to get to this for a while, but it would be very beneficial if those that are interested in the function provide us with:

(a) Use cases to backup the creation of the function.
- Already have LXC.
- Any others?

(b) Documentation for the new function that describes the behaviour.

(c) Documentation that describes how pthread_atfork() behaves with the new function.

Basically, someone needs to draft a design :-)

Thanks.

Comment 16 Fedora Admin XMLRPC Client 2013-01-28 20:09:21 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 17 Fedora End Of Life 2013-04-03 14:13:16 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 18 Fedora End Of Life 2015-01-09 17:35:33 UTC

This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 19 Fedora End Of Life 2015-02-17 14:41:10 UTC

Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 20 Carlos O'Donell 2017-10-31 15:13:36 UTC

Please note that glibc is no longer caching PID/TID as of glibc 2.25.

commit c579f48edba88380635ab98cb612030e3ed8691e
Author: Adhemerval Zanella <adhemerval.zanella>
Date:   Mon Oct 10 15:08:39 2016 -0300

    Remove cached PID/TID in clone
    
    This patch remove the PID cache and usage in current GLIBC code.  Current
    usage is mainly used a performance optimization to avoid the syscall,
    however it adds some issues:
    
      - The exposed clone syscall will try to set pid/tid to make the new
        thread somewhat compatible with current GLIBC assumptions.  This cause
        a set of issue with new workloads and usecases (such as BZ#17214 and
        [1]) as well for new internal usage of clone to optimize other algorithms
        (such as clone plus CLONE_VM for posix_spawn, BZ#19957).
    
      - The caching complexity also added some bugs in the past [2] [3] and
        requires more effort of each port to handle such requirements (for
        both clone and vfork implementation).
    
      - Caching performance gain in mainly on getpid and some specific
        code paths.  The getpid performance leverage is questionable [4],
        either by the idea of getpid being a hotspot as for the getpid
        implementation itself (if it is indeed a justifiable hotspot a
        vDSO symbol could let to a much more simpler solution).
    
        Other usage is mainly for non usual code paths, such as pthread
        cancellation signal and handling.
    
    For thread creation (on stack allocation) the code simplification in fact
    adds some performance gain due the no need of transverse the stack cache
    and invalidate each element pid.
    
    Other thread usages will require a direct getpid syscall, such as
    cancellation/setxid signal, thread cancellation, thread fail path (at
    create_thread), and thread signal (pthread_kill and pthread_sigqueue).
    However these are hardly usual hotspots and I think adding a syscall is
    justifiable.
    
    It also simplifies both the clone and vfork arch-specific implementation.
    And by review each fork implementation there are some discrepancies that
    this patch also solves:
    
      - microblaze clone/vfork does not set/reset the pid/tid field
      - hppa uses the default vfork implementation that fallback to fork.
        Since vfork is deprecated I do not think we should bother with it.
    
    The patch also removes the TID caching in clone. My understanding for
    such semantic is try provide some pthread usage after a user program
    issue clone directly (as done by thread creation with CLONE_PARENT_SETTID
    and pthread tid member).  However, as stated before in multiple discussions
    threads, GLIBC provides clone syscalls without further supporting all this
    semantics.
    
    I ran a full make check on x86_64, x32, i686, armhf, aarch64, and powerpc64le.
    For sparc32, sparc64, and mips I ran the basic fork and vfork tests from
    posix/ folder (on a qemu system).  So it would require further testing
    on alpha, hppa, ia64, m68k, nios2, s390, sh, and tile (I excluded microblaze
    because it is already implementing the patch semantic regarding clone/vfork).
    
    [1] https://codereview.chromium.org/800183004/
    [2] https://sourceware.org/ml/libc-alpha/2006-07/msg00123.html
    [3] https://sourceware.org/bugzilla/show_bug.cgi?id=15368
    [4] http://yarchive.net/comp/linux/getpid_caching.html

Note You need to log in before you can comment on or make changes to this bug.