Bug 682922 - [RFE] Fork performance (for large processes) is slow even with copy-on-write semantics.
Summary: [RFE] Fork performance (for large processes) is slow even with copy-on-write ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: glibc
Version: 7.4
Hardware: All
OS: Linux
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: glibc team
QA Contact: qe-baseos-tools
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-08 01:13 UTC by Dennis
Modified: 2018-07-28 21:29 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-03 03:31:07 UTC
Target Upstream Version:


Attachments (Terms of Use)
This examples shows fork performance degrading as process size increases. (1.44 KB, text/x-c++src)
2011-03-08 01:13 UTC, Dennis
no flags Details
vfork-safely.c (10.25 KB, text/plain)
2014-09-25 17:50 UTC, Carlos O'Donell
no flags Details
Test code to demonstrate fork() delays (1.59 KB, text/x-csrc)
2018-07-28 02:34 UTC, xrobau
no flags Details
Sample code that demonstrates fork() delays (unoptimizable by GCC) (1.88 KB, text/x-csrc)
2018-07-28 21:28 UTC, xrobau
no flags Details

Description Dennis 2011-03-08 01:13:34 UTC
Created attachment 482817 [details]
This examples shows fork performance degrading as process size increases.

Description of problem:

I have attached an example program that highlights slow forking. As process size increases (from near 0MB, then 500MB, then 1GB, then 2GB) fork performance plummets in my example program. Note, fork (even with copy-on-write) still duplicates the page tables from the parent to the child (as well as other accounting). That duplication clearly effects fork performance. 

On some Linux servers I see fork performance plummet to .2secs per fork. For database servers that drop in fork performance is very much noticeable.

vfork avoids this issue (no slow page table copy); but vfork is to be avoided since it suspends the parent process. For example a vfork followed by a child exec that fails may leave the parent suspended; as well as other vfork quirks.  All the advice I've gotten is avoid vfork since the parent suspension is just a generally bad characteristic (especially for multi-threaded database servers).

The clone function only avoids the slow page table copy issue when the CLONE_VM flag is used, but you can only use CLONE_VM safely with CLONE_VFORK, hence, clone will behave the same as vfork.

posix_spawn under the covers is implemented either by fork/exec or vfork/exec (in glibc). Plus posix_spawn is not flexible enough with respect to setting current-working-directory of child.

My question primarily relates to will "copy-on-write shared page table entries" ever be implemented for Linux fork?

David McCracken's proposed "Sharing Page Tables in the Linux Kernel"

  www.kernel.org/doc/ols/2003/ols2003-pages-315-320.pdf

has floated around for years. He also produced a candidate patch
  
  http://thread.gmane.org/gmane.linux.kernel/327286

Is there any scope for the Linux kernel to implement a variant of David's proposal or any other proposal that avoids the slow copy page tables fork issue?

Thanks,

Dennis.

Steps to Reproduce:
1. g++ -O2 fork_bench.cc -o fork_bench
2. ./fork_bench

Comment 2 RHEL Program Management 2011-04-04 02:48:05 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 3 RHEL Program Management 2011-10-07 15:24:56 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 4 Rik van Riel 2011-11-04 18:17:19 UTC
Having this kind of performance enhancement show up in the upstream kernel would be a welcome benefit for future versions of RHEL.

However, adding something so invasive to an already released version of RHEL is not commonly done. We'll be better off simply closing this bug for now, and revisiting things if/when such a performance optimisation shows up upstream.

Comment 5 Dennis 2011-11-05 23:52:54 UTC
Hello Rik & Red Hat team,

The purpose of my original request was to inform Red Hat that fork performance
is a real killer us. Us being a software development house specializing in high
performance text searching.

Can you please inform upstream that this issue is real. I can tell upstream, but
they will ignore me. If Red Hat informs upstream (on behalf of me) that is much
more tangible & likely to lead to something.

Basically in 2011 fork/exec sucks for processes larger than 1GB (e.g any server).

Note, Windows can spawn processes much more efficiently no matter the size
of  the parent process (via CreateProcess function).

I am fine with closing this......but can you please log this upstream?

Thanks,

Dennis.

Comment 6 Rik van Riel 2011-11-06 00:08:51 UTC
If the goal is to fork off helper processes more quickly, I wonder if something resembling CreateProcess, implemented in the C library, using pthread_create and exec may not be the best option.

I believe exec from a thread will end up replacing the mm of just that thread with the new executable, while the creation of a thread does not require that all the page tables are copied over.

I have heard several requests for a Unix equivalent of the CreateProcess interface, but really don't know if the kernel or glibc would be the best place to do it.

Reassigning to the glibc component so we can kick off a discussion on what the best way would be to speed up the creation of a helper process for very large processes. A simple solution to this problem would be best - copy on write page tables seem like overkill complexity to this problem.

Comment 7 Dennis 2011-11-06 00:44:46 UTC
Hi Rik,

All I need is an ability to spawn stand alone processes, I want that operation
to be speedy and consistent no matter the size of the parent process. Fork/exec
does not meet that requirement (it's too slow for large parent processes, the
larger the parent with worse the problem becomes).

A Linux equivalent of CreateProcess would be awesome.

If it can be done all in LibC, great. However I would have thought exec'ing in a
thread would replace the complete process not just the thread? But, I'll leave
that discussion to folks more capable than me.

Cheers,

Dennis.

Comment 8 Rik van Riel 2011-11-06 01:03:34 UTC
The Linux kernel already has most of the bits required to do the equivalent of exec from a thread. This is used to start a helper process from kernel space.

It looks like currently exec does replace the entire process, by killing all the threads, but I imagine it would be possible to create a special thread (clone with a special like CLONE_VFORK, but somewhat different semantics) allowing a subsequent exec to replace only that thread.

Maybe there are some userspace snafus that make this scheme impossible, but I cannot think of any now...

Comment 9 Dennis 2011-11-07 00:01:29 UTC
Rik,

We here very much encourage your suggestion (CreateProcess via
thread/exec type functionality).

We would use it for sure if it were available since it would solve a massive
issue for our database server on Linux.

Note, if a separate item is opened at another mailing please let me know
since I would like to follow any conversation centered around this issue.

Many thanks,

Dennis.

Comment 10 Andreas Schwab 2011-11-07 15:01:15 UTC
posix_spawn(3)

Comment 11 Dennis 2011-11-07 23:15:12 UTC
Andreas,

No, posix_spawn is not a solution as implemented today. Please read
the first message of this support item carefully.

By default posix_spawn is implemented as a wrapper around
fork/exec with  exactly the same performance problems as
listed in the first message of this item (performance degradation
as process size increases).

Secondly, posix_spawn can be directed to use vfork/exec
instead of fork/exec via POSIX_SPAWN_USEVFORK flag. That
avoids the performance problems, but vfork itself is completely 
undesirable for server type processes such as our database server.
vfork suspends multi-thread programs between vfork & exec. Plus
there are other vfork problems (see the first message of this support
item).

Hence, as it stands today posix_spawn is no solution. Same goes
for the clone call which either provides fork like behavior or vfork type
behavior, both of which have major problems.

I will state it as a fact, it is not possible on Linux today to launch
processes speedily (and safely) from a parent process that is 
larger than 1GB in size. 

That problem does not exist on Windows since the CreateProcess
call can spawn processes at a constant (speedy) rate no matter
the size of the parent process, AND without stalling the parent
process such as vfork does.

I object to this item being closed now especially after Rik's 
suggestion got my hopes up for a real solution.

posix_spawn is nothing but a wrapper around fork/exec or
vfork/exec and both of those function pairs are majorly
problematic.

Dennis.

Comment 12 Rik van Riel 2011-11-07 23:36:35 UTC
Andreas, would a kernel side option to clone, that allows an exec afterwards that only replaces ONE thread out of a process, be an option here?

That way we can implement posix_spawn in Linux in a way that does not have users run into the memory overcommit or performance issues of forking a very large process.

If this seems reasonable from the glibc side, I am willing to take on the kernel side of this.

Comment 13 Dennis 2011-11-08 00:07:50 UTC
Guys,

One of the other problems we faced with posix_spawn was that it
was not possible to specify the working directory of the child process.
That was a showstopper for us. We actually think this is a major
failing of posix_spawn.

With fork & exec one can do a chdir in the child prior to an exec.

With CreateProcess on Windows the lpCurrentDirectory parameter
can be set to specify the CWD of the spawned process.

We need such a facility (along with performance obviously).

Dennis.

Comment 14 Dennis 2011-12-29 00:55:32 UTC
Rik,

This item has gone quiet, everybody busy with their day to day.

Anyway, I do want to re-emphasis my very strong interest in native
"exec-from-thread" functionality. If available we would use it in 
our custom database server where fork/exec currently provides
terrible performance (up 0.3 secs per fork/exec when our database
grows larger than 1GB in size).

You did indicate that an implementation would theoretically be 
possible.

Any feeling that you think this will actually happen? Did Andreas 
ever respond to your question from Nov 7th?

Regards,

Dennis.

Comment 16 Carlos O'Donell 2014-09-19 16:53:23 UTC
I'm reopening this bug to continue discussions around a possible userspace and kernel coordinated solution to this problem.

Comment 17 Carlos O'Donell 2014-09-23 03:11:23 UTC
Dennis,

You write:
~~~
vfork avoids this issue (no slow page table copy); but vfork is to be avoided since it suspends the parent process. For example a vfork followed by a child exec that fails may leave the parent suspended; as well as other vfork quirks. All the advice I've gotten is avoid vfork since the parent suspension is just a generally bad characteristic (especially for multi-threaded database servers).
~~~

I have discussed this issue in detail with kernel, core runtime, and debugger experts.

In summary I believe vfork is still a possible immediate solution to your needs.

(1) Vfork in Linux will only suspend the calling *thread* (single task) not the process, therefore it does scale and does not suspend the entire process. Thus you can create a distinct thread that acts as the "launcher" for your helper processes without suspending any other thread in your multi-threaded application. If your application is not threaded, then you could create one thread specifically to leave the original process unblocked and running. If creating a thread isn't an option for some reason, then you may call clone without CLONE_VFORK but with CLONE_VM, and pass a new stack to the cloned child. The stack you pass must only be freed when the child exits, and that should be detectable with waitpid or a O_CLOEXEC shared fd. If you still wish to support fork for this use case then this RFE would remain open to track that since it's still an enhancement to fork.

(2) There are some things you need to do to make it 100% bulletproof. Vfork is difficult to use correctly and some things need to be done carefully. The following are the two most serious problems we I am aware of:

* Do not call any set*id functions from other threads while vfork-ing. Doing this could result in two threads with distinct UIDs sharing the same memory space. As a concrete example a thread might be running as root, vfork a helper, and then proceed to setuid to a lower-priority user, and run some untrusted code. In this case the higher priority root uid child shares the same address space as the low-priority threads. The low-priority threads might then remap parts of the address space to get root uid child, which has not yet exece'd, to execute something else entirely. This is why you should be careful about calling set*id functions while vforking. You avoid this problem by coordinating your credential transitions to happen after you know your vfork is complete i.e. the parent is resumed and this tells you the child has completed execing.

* Block all signals in the parent before calling vfork. This is for the safety of the child which inherits signal dispositions and handlers. The child, running in the parent's stack, may be delivered a signal. For example on Linux a killpg call delivering a signal to a process group may deliver the signal to the vfork-ing child and you want to avoid this. The easy way to do this is via: sigfillset, pthread_sigmask, and then undo this when you return to the parent. To be completely correct the child should set all non-SIG_IGN signals to SIG_DFL and the restore the original signal mask, thus allowing the vforking child to receive signals that were actually intended for it, but without executing any handlers the parent had setup that could corrupt state. When using glibc and Linux these functions i.e. sigfillset, pthread_sigmask, etc. are safe to use after vfork along with chdir() which you require.

With (1) and (2) you should be able to implement a solution using vfork that does what you want and scales.

Thus an RFE to make fork scale is a "nice to have" but not ctritical given the other options.

Comment 18 Dennis 2014-09-23 03:29:35 UTC
Hello Carlos,

Thank you for your very informative post. It does indeed make me think that vfork/exec actually could be a viable solution to my problem; the 100% bulletproof tips also appear to be critically important.

Suspending only the calling thread will not be an issue for us, we will not need to spawn a thread just to vfork/exec. Apologies for my incorrect belief that vfork suspended the process not the thread.

Good quality documentation for problem-free fast constant time Linux process creation is hard to come by. Your post just now may be the best and clearest information I've gotten on this topic.

What is your thinking in regards to fork RFE?

vfork may work as you say, but to get it right seems very very hard for Joe Average (your two bullet proof tips), most folks would have no idea they would need to do all that. Hence, if fork/exec is slow then they move to vfork/exec they probably will encounter "hard-to-debug" issues because they did not know they needed to do the "bulletproofing".

My thinking is, if "fast constant time spawn only from thread" fork/posix_spawn enhancement is viable to do then it should be done since it could benefit many other database server vendors.

If Linus or the glibc are not inclined to do this them maybe it should be closed.

Let me know your thoughts and your fellow engineer's thoughts.

Thanks,

Dennis.

Comment 19 Dennis 2014-09-23 03:43:31 UTC
Carlos,

You listed two import tips for safe vforking (not set*ids, block signals). Are there any other instructions we should do?

For example, if the child exec fails what happens to the frozen parent? Will it be frozen forever? Should you actually always spawn a "throw away thread", do vfork, do exec and then junk the "throw away thread"? If the exec fails how will we know?

I'm just asking for the best and safest vfork/exec tips from your esteemed team.

Cheers,

Dennis.

Comment 20 Carlos O'Donell 2014-09-23 13:40:48 UTC
(In reply to Dennis from comment #18)
> Thank you for your very informative post. It does indeed make me think that
> vfork/exec actually could be a viable solution to my problem; the 100%
> bulletproof tips also appear to be critically important.

I agree. It's entirely our fault, and when I say that I mean both Linux, glibc, and the linux kernel man pages are at fault for not providing a canned sequence showing the correct operation. Worse is that we don't tell you that we'll guarantee the signal handling functions work between vfork and exec.
 
> Good quality documentation for problem-free fast constant time Linux process
> creation is hard to come by. Your post just now may be the best and clearest
> information I've gotten on this topic.

Thank you. I will work to get all of this documented upstream and in the manual page.
 
> What is your thinking in regards to fork RFE?

You can feel free to leave it open as an RFE until you complete a vfork/exec implementation on your side. Until then this RFE stands as a request to make fork more performant to avoid all of the work required with vfork.
 
> vfork may work as you say, but to get it right seems very very hard for Joe
> Average (your two bullet proof tips), most folks would have no idea they
> would need to do all that. Hence, if fork/exec is slow then they move to
> vfork/exec they probably will encounter "hard-to-debug" issues because they
> did not know they needed to do the "bulletproofing".

That's correct. It's a documentation issue I plan to correct.

> My thinking is, if "fast constant time spawn only from thread"
> fork/posix_spawn enhancement is viable to do then it should be done since it
> could benefit many other database server vendors.

I agree. We are going to fix posix_spawn upstream to ensure it's safe and uses vfork as much as possible.

> If Linus or the glibc are not inclined to do this them maybe it should be
> closed.

I haven't seen any opposition to the idea of making fork faster, or making posix_spawn use vfork more often, it's just that such a change is inherently disruptive to the RHEL product which should remain stable. Therefore we work hard to try find solutions and workarounds for our customers based on the stable platform that is already deployed.

An RFE like this might take until RHEL8 or RHEL9 to get fixed if we have enough interest.
 
> Let me know your thoughts and your fellow engineer's thoughts.

I would say leave this ticket open until you complete a vfork solution, and then close it.

Comment 21 Carlos O'Donell 2014-09-23 13:56:54 UTC
(In reply to Dennis from comment #19)
> You listed two import tips for safe vforking (not set*ids, block signals).
> Are there any other instructions we should do?

The other tips largely depend on how much sharing of information you need to do with the child before exec.

For example it would be best practice to have file descriptors set to O_CLOEXEC (close on exec) if you don't want to or can't close them in the child after exec. This is also a convenient way to know the child has execd.
 
> For example, if the child exec fails what happens to the frozen parent? Will
> it be frozen forever? Should you actually always spawn a "throw away
> thread", do vfork, do exec and then junk the "throw away thread"? If the
> exec fails how will we know?

If vfork fails the child is never created and the parent can check the return value 

If the exec fails in the child your only option is to call _exit which terminates the child and resumes the parent. The rest of the execution is similar to the case where it didn't fail, namely waiting for the terminatio of the child.

> I'm just asking for the best and safest vfork/exec tips from your esteemed
> team.

Let me put together some C code with the exact sequence since I'll be using that to document what should be done.

Comment 22 Dennis 2014-09-24 01:59:25 UTC
Thanks Carlos, great information.

I look forward to the C code being posted here.

Cheers, Dennis.

Comment 23 Carlos O'Donell 2014-09-25 17:50:23 UTC
Created attachment 941229 [details]
vfork-safely.c

This is an initial draft of the example that shows how to safely vfork from a multi-threaded process.

If I get any more feedback from our internal review I'll update the attachment and tell you about it.

Unfortunately posix_spawn* functions can't be used safely until we fix some issues with them, and thus vfork (with no set*id calls) is the safest interface to use right now.

Comment 24 Dennis 2014-09-26 02:08:04 UTC
Thank you Carlos, your example looks really great, all the gotcha's are listed and the comments are very information. Many thanks indeed.

As for our time-lines, we are in Release Mode right now, hence we won't be able to do the integration and validity testing until November/December.

But I see no more roadblocks in terms of API changes required for us. We won't use posix_spawn due to the chdir issue. vfork/exec is sufficient for us.

I really do appreciate your assistance.

Dennis.

Comment 27 Carlos O'Donell 2016-08-09 18:54:34 UTC
Given that RHEL 6.9 is in production phase 2 we will not be considering this enhancement for RHEL 6. I've moved this to RHEL 7 to consider. This seems more feasible given some of the recent upstream glibc enhancements to posix_spawn to use vfork in more cases without burdening the developer with the details of the usage of vfork.

Comment 30 Carlos O'Donell 2016-11-26 16:08:53 UTC
In summary:

Upstream work on an enhanced posix_spawn is more well tested and supports using clone, not fork, for better scalability. We can review this in rhel-7.5.0 to see what to do. 

Notes:
There is an uncomfortable amount of change around the new posix_spawn patches which might mean this enhancement is only ever for rhel-8. In particular the changes in semantics to the cached PID to remove the -1 setting will need re-review given rhel-7's age (2.17), but it does fix some pthread_join race conditions (can't join a thread in the middle of this transition).

Comment 31 Carlos O'Donell 2016-11-26 16:14:02 UTC
Dennis,

What kind of posix_spawn enhancements would you need for your particular use case you are using today in your application?

I want to make sure that the posix_spawn's new vfork usages matches your needs before we go forward to consider a backport for say rhel-7.5.

Thanks.

Comment 33 Carlos O'Donell 2018-04-03 03:31:07 UTC
The upstream posix_spawn{p} implementation (commit 9ff72da471a509a8c19791efe469f47fa6977410) has seen a dozen follow-on patches to fix issues and correct bugs.

There is a lot of churn here though and the fixes really require the tid/pid cache cleanup which could negatively impact some RHEL 7 users.

Therefore for now I'm going to mark this as CLOSED/WONTFIX. The use of posix_spawn{p} as a replacement for the slow fork will have to wait until the next release of RHEL where we rebase glibc.

Comment 34 xrobau 2018-07-28 02:32:39 UTC
This ticket was incredibly helpful in diagnosing some issues we are having in Asterisk on RHEL, where fork() was causing audio breakups.

After investigation, it turned out that upgrading glibc to 2.27 made almost no difference to the speed of fork().  However, an unexpected resolution was discovered when using gcc 7.3.1 to compile our test code - the issue went away. Totally.

gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)
[root@freepbx g]# gcc -O2 -o f forktest.c
[root@freepbx g]# time ./f
Time taken per fork:            0.000250
Time taken with 500MB:          0.000248
Time taken with 1GB:            0.000248
Time taken with 2GB:            0.000247
Time taken with 4GB:            0.000245

real    0m0.125s
user    0m0.074s
sys     0m0.063s
[root@freepbx g]#

Using the standard gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC), the fork is much slower, and real time for the same code was 3.3s.  

I hope this information may help someone else who has gone down the same rabbit hole that I have over the past few days!

Comment 35 xrobau 2018-07-28 02:34:00 UTC
Created attachment 1471184 [details]
Test code to demonstrate fork() delays

Comment 36 xrobau 2018-07-28 03:25:33 UTC
Sigh. Please ignore my previous statement.  gcc 7.3 is just much smarter about optimizing out unused code, so it never actually malloc()s, which is why it's fast.

Comment 37 xrobau 2018-07-28 21:28:22 UTC
Created attachment 1471303 [details]
Sample code that demonstrates fork() delays (unoptimizable by GCC)

This explicitly reads from the malloc()'ed area into a volatile int, which ensures that GCC will not optimize the code out.


Note You need to log in before you can comment on or make changes to this bug.