Bug 1766224

Summary: erl_child_setup spams close() with large file descriptor limit
Product: Red Hat OpenStack Reporter: John Eckersberg <jeckersb>
Component: erlangAssignee: John Eckersberg <jeckersb>
Status: CLOSED EOL QA Contact: nlevinki <nlevinki>
Severity: medium Docs Contact:
Priority: high    
Version: 15.0 (Stein)CC: apevec, jeckersb, lhh, lmiccini, michele, stchen
Target Milestone: ---Keywords: Performance, Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-30 19:15:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Eckersberg 2019-10-28 15:34:41 UTC
Description of problem:
Every time the erlang VM gets invoked, which happens repeatedly via rabbitmqctl during health checks, the erl_child_setup process gets started like this:

[root@controller-0 ~]# ps -ef | grep erl_child_setup
42439      17144   16985  0 Oct15 ?        00:01:19 erl_child_setup 65536

The 65536 argument is supplied by the VM to reflect the maximum file descriptors.

Then the erl_child_setup process closes all available fds in a loop:

https://github.com/erlang/otp/blob/736601dd0316bd7bc6060cd4fd0379473f6db682/erts/emulator/sys/unix/erl_child_setup.c#L428-L441

Because linux does not have closefrom() this invokes a close() for basically all fds.


Version-Release number of selected component (if applicable):
()[root@controller-0 /]# rpm -q erlang-erts
erlang-erts-20.3.8.22-1.el8ost.x86_64

How reproducible:
Always

Steps to Reproduce:
[root@controller-0 ~]# ulimit -n 65536
[root@controller-0 ~]# strace -f -e trace=close erl -noshell -eval 'init:stop().' 2>&1 | grep EBADF | wc -l
65525


Actual results:
calls close() on lots of bad file descriptors

Expected results:
should only close() descriptors that are actually open

Additional info:
Comparison of VM launch time by differing fd limits (under strace to time children as well):

[root@controller-0 ~]# ulimit -n 1024
[root@controller-0 ~]# time strace -qq -f -e trace=none erl -noshell -eval 'init:stop().'
[pid 720062] --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=720061, si_uid=0} ---
[pid 720062] +++ killed by SIGUSR1 +++
[pid 719999] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=720061, si_uid=0, si_status=0, si_utime=0, si_stime=1} ---

real    0m2.103s
user    0m0.548s
sys     0m0.600s
[root@controller-0 ~]# ulimit -n 65536
[root@controller-0 ~]# time strace -qq -f -e trace=none erl -noshell -eval 'init:stop().'
[pid 722323] --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=722322, si_uid=0} ---
[pid 722323] +++ killed by SIGUSR1 +++
[pid 722025] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=722322, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---

real    0m5.774s
user    0m1.014s
sys     0m2.279s

Comment 1 John Eckersberg 2019-10-28 15:56:31 UTC
I have a patch for this, will link here when I post it upstream.

Comment 2 John Eckersberg 2019-10-29 16:21:47 UTC
(In reply to John Eckersberg from comment #1)
> I have a patch for this, will link here when I post it upstream.

https://github.com/erlang/otp/pull/2438

Comment 3 John Eckersberg 2019-11-04 15:19:00 UTC
(In reply to John Eckersberg from comment #2)
> (In reply to John Eckersberg from comment #1)
> > I have a patch for this, will link here when I post it upstream.
> 
> https://github.com/erlang/otp/pull/2438

Merged upstream

Comment 4 stchen 2020-09-30 19:15:03 UTC
Closing EOL, OSP 15 has been retired as of Sept 19