Bug 896281

Summary: Stop using clone(2) and use fork() instead
Product: [Community] Virtualization Tools Reporter: Eric Paris <eparis>
Component: libvirtAssignee: Libvirt Maintainers <libvirt-maint>
Status: CLOSED WONTFIX QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: berrange, clalancette, crobinso, dwalsh, itamar, jakub, jforbes, laine, libvirt-maint, rbalakri, veillard, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-23 21:58:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Paris 2013-01-16 23:10:51 UTC
In src/lxc/lxc_container.c you use the clone(2) syscall in order to get the namespaces requested.  But using clone is a VERY bad idea.  after a call to clone(2) you can't even use getpid() and more.  You can't use any glibc functions safely.  The quote from Jakub about clone(2) ?

basically after it you can't call any glibc functions at all
even errno won't work, getpid() won't work etc.

We found this problem because of libselinux libraries trying to cache userspace information.  Since you did not use fork() or pthread_create() or a libc mechanism, and instead directly used a kernel mechanism, the library falls down.  If you can't cound on glibc() functions after clone(2) you can't count on libselinux functions after clone(2).

The suggestion would be to do

fork();
unshare();
unshare();
unshare();
unshare();

Then you don't break the underlying libraries...

Comment 1 Daniel Berrangé 2013-01-17 17:18:12 UTC
I'm afraid we can't do this. It is not possible use CLONE_NEWPID with unshare(). New PID namespaces can only be started with clone().

WRT to clone() + getpid() the man page disagrees wtih what you describe

      "Versions of the GNU C library that include the NPTL threading library  contain  a
       wrapper  function  for  getpid(2)  that  performs  caching of PIDs.  This caching
       relies on support in the glibc wrapper for clone(), but as currently implemented,
       the  cache may not be up to date in some circumstances.  In particular, if a sig‐
       nal is delivered to the child immediately after the clone() call, then a call  to
       getpid(2)  in  a handler for the signal may return the PID of the calling process
       ("the parent"), if the clone wrapper has not yet had a chance to update  the  PID
       cache  in  the child. "

we don't have a case where a signal handler will run, so that doesn't matter to us. We have not encountered any problems with glibc in our current usage of clone.

What is the specific problem libselinux is seeing ?

Comment 2 Eric Paris 2013-01-17 19:00:30 UTC
setfscreatecon basically does:

static __thread char *cachedvalue;

void clear(void)
{
    cachedvalue = NULL;
}

int setfscreatecon(char *newvalue)
{
    pthread_atfork(NULL, NULL, clear); // clear TLS on fork()

    if (!strncmp(newvalue, cachedvalue))
        return 0;
    set_new_value_in_kernel(newvalue);
    cachedvalue = newvalue;
}

Now libvirt uses the setfscreatecon (and getfscreatecon) which will set the cachedvalue.  On clone(2) there is no way to UNSET the cachedvalue in the userspace library, even though in the kernel the new task will not have a value set.  So when the child uses get/set it will be against the CACHED value, which is inconsistent with the kernel state.  glibc provides no mechansim to do the cleanup...

If you use fork() glibc will call out to the pthread_atfork() -> clear() function.  If you use pthreads, glibc will take care of the variable because it is __thread.  But with clone, you just get the same old value.

Hoping the libc people will help me find a better way, but there is no atclone() equivalent to pthread_atfork().

As of v3.8 the kernel supports unshare(CLONE_NETPID)  commit 50804fe37

I will implement a workaround in the library to try to make it usually work with your current code 

__thread pid_t lasttid;

int setfscreatecon(char *newvalue)
{
    pid_t new_tid;

    new_tid = syscall(__NR_gettid);
    if (newtid != lasttid) {
        clear();
        lasttid = newtid;
    }
[...]
}

This isn't perfect as tid wrap around means we could still mess it up.  But it'll work most of the time...

Comment 3 Daniel Berrangé 2013-01-17 19:14:17 UTC
Why is libselinux attempting todo such caching ? I tend to feel that libselinux ought to be a direct mapping to the kernel corresponding syscalls, and if apps need caching they ought to do it at their code level

Comment 4 Eric Paris 2013-01-17 19:23:51 UTC
We found 2 users with significant enough usage to make the caching beneficial, udev/systemd and coreutils (namely install).

Once we have 2 users needing to do the exact same thing it starts to look reasonable to try to do the work at the lower level rather than in each program.

I'll put my workaround in the library, but I'd really like to see libvirt move to fork()+unshare() (like systemd)

Comment 5 Daniel Berrangé 2013-01-17 21:18:16 UTC
(In reply to comment #4)
> We found 2 users with significant enough usage to make the caching
> beneficial, udev/systemd and coreutils (namely install).
> 
> Once we have 2 users needing to do the exact same thing it starts to look
> reasonable to try to do the work at the lower level rather than in each
> program.
> 
> I'll put my workaround in the library, but I'd really like to see libvirt
> move to fork()+unshare() (like systemd)

systemd only uses unshare() when it is giving system services a private network or filesystem namespace.  For the systemd  nsspawn command, it uses clone in the same way as libvirt. Much as I'd like to use unshare(), we need to use CLONE_NEWPID, so clone() is our only option.

Comment 6 Eric Paris 2013-01-17 21:29:04 UTC
I understand your situation.  CLONE_NEWPID does work with unshare in kernel 3.8.  But nothing but rawhide has that right now...

Comment 7 Cole Robinson 2013-04-01 13:12:58 UTC
Moving to the upstream tracker.

3.8 is available generally in fedora now, but it doesn't sound like the current libvirt way of doing things is causing obvious problems. What's the conclusion here?

Comment 8 Cole Robinson 2016-03-23 21:58:20 UTC
lxc still uses clone() (as does libcontainer for example). I don't think this is ever going to change, so closing