In src/lxc/lxc_container.c you use the clone(2) syscall in order to get the namespaces requested. But using clone is a VERY bad idea. after a call to clone(2) you can't even use getpid() and more. You can't use any glibc functions safely. The quote from Jakub about clone(2) ? basically after it you can't call any glibc functions at all even errno won't work, getpid() won't work etc. We found this problem because of libselinux libraries trying to cache userspace information. Since you did not use fork() or pthread_create() or a libc mechanism, and instead directly used a kernel mechanism, the library falls down. If you can't cound on glibc() functions after clone(2) you can't count on libselinux functions after clone(2). The suggestion would be to do fork(); unshare(); unshare(); unshare(); unshare(); Then you don't break the underlying libraries...
I'm afraid we can't do this. It is not possible use CLONE_NEWPID with unshare(). New PID namespaces can only be started with clone(). WRT to clone() + getpid() the man page disagrees wtih what you describe "Versions of the GNU C library that include the NPTL threading library contain a wrapper function for getpid(2) that performs caching of PIDs. This caching relies on support in the glibc wrapper for clone(), but as currently implemented, the cache may not be up to date in some circumstances. In particular, if a sig‐ nal is delivered to the child immediately after the clone() call, then a call to getpid(2) in a handler for the signal may return the PID of the calling process ("the parent"), if the clone wrapper has not yet had a chance to update the PID cache in the child. " we don't have a case where a signal handler will run, so that doesn't matter to us. We have not encountered any problems with glibc in our current usage of clone. What is the specific problem libselinux is seeing ?
setfscreatecon basically does: static __thread char *cachedvalue; void clear(void) { cachedvalue = NULL; } int setfscreatecon(char *newvalue) { pthread_atfork(NULL, NULL, clear); // clear TLS on fork() if (!strncmp(newvalue, cachedvalue)) return 0; set_new_value_in_kernel(newvalue); cachedvalue = newvalue; } Now libvirt uses the setfscreatecon (and getfscreatecon) which will set the cachedvalue. On clone(2) there is no way to UNSET the cachedvalue in the userspace library, even though in the kernel the new task will not have a value set. So when the child uses get/set it will be against the CACHED value, which is inconsistent with the kernel state. glibc provides no mechansim to do the cleanup... If you use fork() glibc will call out to the pthread_atfork() -> clear() function. If you use pthreads, glibc will take care of the variable because it is __thread. But with clone, you just get the same old value. Hoping the libc people will help me find a better way, but there is no atclone() equivalent to pthread_atfork(). As of v3.8 the kernel supports unshare(CLONE_NETPID) commit 50804fe37 I will implement a workaround in the library to try to make it usually work with your current code __thread pid_t lasttid; int setfscreatecon(char *newvalue) { pid_t new_tid; new_tid = syscall(__NR_gettid); if (newtid != lasttid) { clear(); lasttid = newtid; } [...] } This isn't perfect as tid wrap around means we could still mess it up. But it'll work most of the time...
Why is libselinux attempting todo such caching ? I tend to feel that libselinux ought to be a direct mapping to the kernel corresponding syscalls, and if apps need caching they ought to do it at their code level
We found 2 users with significant enough usage to make the caching beneficial, udev/systemd and coreutils (namely install). Once we have 2 users needing to do the exact same thing it starts to look reasonable to try to do the work at the lower level rather than in each program. I'll put my workaround in the library, but I'd really like to see libvirt move to fork()+unshare() (like systemd)
(In reply to comment #4) > We found 2 users with significant enough usage to make the caching > beneficial, udev/systemd and coreutils (namely install). > > Once we have 2 users needing to do the exact same thing it starts to look > reasonable to try to do the work at the lower level rather than in each > program. > > I'll put my workaround in the library, but I'd really like to see libvirt > move to fork()+unshare() (like systemd) systemd only uses unshare() when it is giving system services a private network or filesystem namespace. For the systemd nsspawn command, it uses clone in the same way as libvirt. Much as I'd like to use unshare(), we need to use CLONE_NEWPID, so clone() is our only option.
I understand your situation. CLONE_NEWPID does work with unshare in kernel 3.8. But nothing but rawhide has that right now...
Moving to the upstream tracker. 3.8 is available generally in fedora now, but it doesn't sound like the current libvirt way of doing things is causing obvious problems. What's the conclusion here?
lxc still uses clone() (as does libcontainer for example). I don't think this is ever going to change, so closing