Bug 1469670 - glibc: Implement vDSO-based getpid
glibc: Implement vDSO-based getpid
Status: NEW
Product: Fedora
Classification: Fedora
Component: glibc (Show other bugs)
26
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Carlos O'Donell
Fedora Extras Quality Assurance
:
Depends On: 1469757
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-11 11:24 EDT by Lennart Poettering
Modified: 2017-07-20 09:23 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Workaround for efficient systemd getpid() call (42 bytes, text/plain)
2017-07-12 11:03 EDT, Iliana Petrova
no flags Details

  None (edit)
Description Lennart Poettering 2017-07-11 11:24:03 EDT
As I commented earlier on bug 1443976:

<snip>
Umm, this change causes major slowdowns in various bits of systemd code, which assumes that getpid() is fast. libsystemd uses getpid() to detect when processes fork, to then politely return an error from all calls that operate on objects that cannot be reused after fork(). This used to be dead cheap on all glibc versions, but suddenly becomes excessively expensive, see this thread:

https://lists.freedesktop.org/archives/systemd-devel/2017-July/039195.html

I find the explanation "The glibc PID cache negatively interacts with setting up containers and namespaces." quite weak... What is "negatively interacts" supposed to mean?

Is this regression going to stay? I am kinda surprised by this change I must say, as it causes real problems, and there's no clear benefit of the change...
</snip>

I opened this bug here mostly since that other bug is already closed, but I figure there should be an open bug about this regression, too.
Comment 1 Florian Weimer 2017-07-11 15:15:10 EDT
We can redirect getpid to a vDSO syscall if the kernel provides one.  I doubt that we will reintroduce the PID cache because it means that glibc needs to know about the semantics of the unshare and clone flags and reset the PID cache for the right flag combinations (an unconditional reset is impossible, unfortunately).

Your fork detection use case could also be served by a new type of mapping which is zeroed on fork (which has also been suggested for fork protection of PRNGs).

I filed bug 1469757 for the kernel side of the vDSO change.
Comment 2 Carlos O'Donell 2017-07-11 19:18:49 EDT
(In reply to Lennart Poettering from comment #0)
> As I commented earlier on bug 1443976:
> 
> <snip>
> Umm, this change causes major slowdowns in various bits of systemd code,
> which assumes that getpid() is fast. libsystemd uses getpid() to detect when
> processes fork, to then politely return an error from all calls that operate
> on objects that cannot be reused after fork(). This used to be dead cheap on
> all glibc versions, but suddenly becomes excessively expensive, see this
> thread:
> 
> https://lists.freedesktop.org/archives/systemd-devel/2017-July/039195.html

Yes, with glibc 2.25 it has become expensive to do getpid() in a loop, but that is really the only operation where the latency of the getpid syscall is not hidden by other system latencies.

In your case it looks like you've had a 20-30% performance regression in this specific code because of the tight getpid() loop. We understand that this is bad, and we apologize that you will have to rewrite this to get back your performance.

There is a solution to this, and if your only interest is in detecting usage after fork(), and only fork(), then you can use pthread_atfork() to set handlers that clear or reset your own cached values.

If you do not want to link against libpthread, then we can go upstream again with our Fedora-specific patch which allows pthread_atfork without needing to link against libpthread (already in Fedora for an SELinux workaround).

This solution however does not extend to all the places where such an application cache may need invalidation.

> I find the explanation "The glibc PID cache negatively interacts with
> setting up containers and namespaces." quite weak... What is "negatively
> interacts" supposed to mean?

The most difficult issue to get right is cache invalidation. Getting the wrong value from getpid() is fatal in many cases and so conservatively we have to get it right 100% of the time in all cases.

Even more difficult the cache is thread local to ensure fast access to the stored value for the thread. Thus when entering a PID namespace the cache needs to be invalidated immediately and that requires complex machinery, to signal all threads to invalidate their caches. Such synchronization machinery might now require getpid to block in all threads while caches are invalidated, and that's also terrible for performance, complexity, and maintenance costs.

The need to get the cache invalidated correctly quickly spiralled out of control over the last 5 years, and the net result is that glibc will no longer maintain a userspace cache (we had clone issues, and detached thread issues to deal with also). 

We will pursue a kernel solution, which clears a value on fork, similar in the way that CLONE_CHILD_CLEARTID clears the TID when a thread dies. This ensures that the kernel is arbiter of when the fork is complete and can tell userspace about that with a memory operation on a registered location.

Does that answer your question?
Comment 3 Florian Weimer 2017-07-11 20:17:20 EDT
(In reply to Carlos O'Donell from comment #2)
> We will pursue a kernel solution, which clears a value on fork, similar in
> the way that CLONE_CHILD_CLEARTID clears the TID when a thread dies.

I don't think this is the right design.  I am not even sure if fork is the only way to change the PID today, and who knows what the future will bring.

(Even CLONE_CHILD_CLEARTID turned out to be the wrong design for threads because thread exit is racy as a result, and the TID races in the current implementation are actually non-conforming.  We tried to paper this over with tgkill, but that only avoids sending signals to other processes.)
Comment 4 Carlos O'Donell 2017-07-11 20:54:01 EDT
(In reply to Florian Weimer from comment #3)
> (In reply to Carlos O'Donell from comment #2)
> > We will pursue a kernel solution, which clears a value on fork, similar in
> > the way that CLONE_CHILD_CLEARTID clears the TID when a thread dies.
> 
> I don't think this is the right design.  I am not even sure if fork is the
> only way to change the PID today, and who knows what the future will bring.
> 
> (Even CLONE_CHILD_CLEARTID turned out to be the wrong design for threads
> because thread exit is racy as a result, and the TID races in the current
> implementation are actually non-conforming.  We tried to paper this over
> with tgkill, but that only avoids sending signals to other processes.)

I agree that there are problems with CLONE_CHILD_CLEARTID (ownership of the affected memory becomes a problem). I did not expect we would use exactly that mechanism, my apologies if that was implied.

There needs to be a deep discussion on how we do the notification to userspace, and when and how. It won't be solved tomorrow.
Comment 5 Lennart Poettering 2017-07-12 03:59:02 EDT
Not sure I follow, but wouldn't it suffice if getpid() caching is only turned off as soon as the first clone() or unshare() are called? These calls are pretty exotic, only a small number of programs will call them, but I am sure a much higher number of apps do benefit from the getpid() caching. I mean, there's a lot of room between "keep the getpid() cache updated correctly in all cases" and "do not do any getpid() caching" — some form of "maintain the getpid() cache for the obvious cases, turn it off only as soon as things complicated"...

I mean, maintaining a simple cache that oyu just turn off eventually isn't that hard, is it? and an atomic op for doing that in a thread-safe manner is still a ton faster than doing the full syscall all the time...
Comment 6 Lennart Poettering 2017-07-12 04:04:14 EDT
> We will pursue a kernel solution, which clears a value on fork, similar in the way that CLONE_CHILD_CLEARTID clears the TID when a thread dies. This ensures that the kernel is arbiter of when the fork is complete and can tell userspace about that with a memory operation on a registered location.

> Does that answer your question?

Not really, I would have preferred if glibc wouldn't regress on this without more consideration, and not until the new kernel infrastructure for this is in place.
Comment 7 Lennart Poettering 2017-07-12 04:06:11 EDT
> Even more difficult the cache is thread local to ensure fast access to the stored value for the thread. 

Why go directly from a thread local cache to calling the syscall? You could just convert it into an process-wide cache... that should still be quicker than the syscall..
Comment 8 Lennart Poettering 2017-07-12 04:09:01 EDT
(In reply to Florian Weimer from comment #1)

> needs to know about the semantics of the unshare and clone flags and reset
> the PID cache for the right flag combinations (an unconditional reset is
> impossible, unfortunately).

Why does "unshare()" even matter here? Last time I looked it doesn't even change the calling process' PID, but defers that for the next fork() child, where you need to invalidate anyway... What is this really about?
Comment 9 Lennart Poettering 2017-07-12 04:46:58 EDT
BTW, with this change you are breaking expressly documented behaviour

http://man7.org/linux/man-pages/man2/getpid.2.html

    "Since glibc version 2.3.4, the glibc wrapper function for
     getpid() caches PIDs, so as to avoid additional system calls when
     a process calls getpid() repeatedly."
Comment 10 Iliana Petrova 2017-07-12 11:03 EDT
Created attachment 1297042 [details]
Workaround for efficient systemd getpid() call

Lennart,

in order to quickly fix this pressing matter, I've attached code that I believe will fix your issue. It uses a custom caching mechanism and fits well into the systemd ecosystem.

Best regards.
Comment 11 Carlos O'Donell 2017-07-12 14:27:22 EDT
(In reply to Lennart Poettering from comment #7)
> Why go directly from a thread local cache to calling the syscall? You could
> just convert it into an process-wide cache... that should still be quicker
> than the syscall..

Unfortunately the community did not find there was a good risk/reward for this kind of caching given the problems we've had with the cache over the years.

Regarding giving consideration and waiting for the kernel interface to appear, yes, in the best case it's always a benefit to wait for the kernel interface to appear. Here, there were benefits to the removal too. In the communities judgement the immediate benefits outweighed waiting for the kernel interface to be finished.

(In reply to Lennart Poettering from comment #9)
> BTW, with this change you are breaking expressly documented behaviour
> 
> http://man7.org/linux/man-pages/man2/getpid.2.html
> 
>     "Since glibc version 2.3.4, the glibc wrapper function for
>      getpid() caches PIDs, so as to avoid additional system calls when
>      a process calls getpid() repeatedly."

The Linux man pages project is not canonical documentation for glibc.
This quote is also part of the non-normative NOTES.
Therefore no explicit contract has been broken.

I have submitted a patch to the linux man-pages project 
to add a note about the change in implementation behaviour.

I have updated the glibc 2.25 release notes with more explanations for
developers (https://sourceware.org/glibc/wiki/Release/2.25#pid_cache_removal).

(In reply to Lennart Poettering from comment #8)
> (In reply to Florian Weimer from comment #1)
> 
> > needs to know about the semantics of the unshare and clone flags and reset
> > the PID cache for the right flag combinations (an unconditional reset is
> > impossible, unfortunately).
> 
> Why does "unshare()" even matter here? Last time I looked it doesn't even
> change the calling process' PID, but defers that for the next fork() child,
> where you need to invalidate anyway... What is this really about?

Is pthread_atfork() a usable solution for systemd?
Comment 12 Lennart Poettering 2017-07-17 08:58:13 EDT
> Is pthread_atfork() a usable solution for systemd?

Hmpf, the pthread_atfork() API is just fucked...

There's no way to unregister an atfork handler, is there? This means we'd have to mark libsystemd somehow as RTLD_NODELETE, so that this doesn't fall apart as soon as somebody pulls in libsystemd.so as a dlopen() dep, makes use of it for a short time and then unloads it again with dlopen(). If we can't unload our pthread_atfork() handler in that case we'd cause a bad memory access on the next fork().

What a mess!
Comment 13 Carlos O'Donell 2017-07-17 15:15:07 EDT
(In reply to Lennart Poettering from comment #12)
> > Is pthread_atfork() a usable solution for systemd?
> 
> Hmpf, the pthread_atfork() API is just fucked...
> 
> There's no way to unregister an atfork handler, is there? This means we'd
> have to mark libsystemd somehow as RTLD_NODELETE, so that this doesn't fall
> apart as soon as somebody pulls in libsystemd.so as a dlopen() dep, makes
> use of it for a short time and then unloads it again with dlopen(). If we
> can't unload our pthread_atfork() handler in that case we'd cause a bad
> memory access on the next fork().
> 
> What a mess!

It can be done with -Wl,-z,nodelete.

This is obviously just a workaround, and even then I think we might be able to automatically unregister such functions at dlclose() time transparently. Apparently IBM's AIX used to do this also, but we'd have to have a bulletproof way of doing that. I'm not sure of all of the consequences involved.

Even if we fixed the unregister at dlclose() you would have to wait for such a glibc to propagate out, and so you'd need nodelete flags for your DSO until you know it's OK.
Comment 14 Florian Weimer 2017-07-17 16:01:44 EDT
(In reply to Lennart Poettering from comment #12)
> > Is pthread_atfork() a usable solution for systemd?
> 
> Hmpf, the pthread_atfork() API is just fucked...
> 
> There's no way to unregister an atfork handler, is there? This means we'd
> have to mark libsystemd somehow as RTLD_NODELETE, so that this doesn't fall
> apart as soon as somebody pulls in libsystemd.so as a dlopen() dep, makes
> use of it for a short time and then unloads it again with dlopen(). If we
> can't unload our pthread_atfork() handler in that case we'd cause a bad
> memory access on the next fork().

glibc is supposed to do this automatically.  There's a definition of pthread_atfork in libpthread_nonshared.a which passes __dso_handle to the actual implementation.  At dlclose time, we search for handlers with a matching __dso_handle and automatically unregister them.
Comment 15 Carlos O'Donell 2017-07-18 10:02:03 EDT
(In reply to Florian Weimer from comment #14)
> (In reply to Lennart Poettering from comment #12)
> > > Is pthread_atfork() a usable solution for systemd?
> > 
> > Hmpf, the pthread_atfork() API is just fucked...
> > 
> > There's no way to unregister an atfork handler, is there? This means we'd
> > have to mark libsystemd somehow as RTLD_NODELETE, so that this doesn't fall
> > apart as soon as somebody pulls in libsystemd.so as a dlopen() dep, makes
> > use of it for a short time and then unloads it again with dlopen(). If we
> > can't unload our pthread_atfork() handler in that case we'd cause a bad
> > memory access on the next fork().
> 
> glibc is supposed to do this automatically.  There's a definition of
> pthread_atfork in libpthread_nonshared.a which passes __dso_handle to the
> actual implementation.  At dlclose time, we search for handlers with a
> matching __dso_handle and automatically unregister them.

Good catch. I'd missed the registration of __unregister_atfork() via the process global destructors (hidden by a macro). So it looks like we don't have anything to fix. If you dlopen a library with atfork handler then they will be removed at unload time.
Comment 16 Lennart Poettering 2017-07-20 09:23:05 EDT
Weird undocumented automatism, but OK. I figure that'll do then. Still not happy that this semi-broke all kinds of systemd, PulseAudio, libcanberra installations already out there (and even got backported to "stable" stuff), but I'll work around this now at least in libsystemd... Thanks.

Note You need to log in before you can comment on or make changes to this bug.