1242406 – performance degradation in "ulimit -u 7810" environments (default on rhel6.3+ and rhel7 deployments)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1242406 - performance degradation in "ulimit -u 7810" environments (default on rhel6.3+ and rhel7 deployments)

Summary: performance degradation in "ulimit -u 7810" environments (default on rhel6.3+...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	bash
Sub Component:
Version:	6.8
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Siteshwar Vashisht
QA Contact:	BaseOS QE - Apps
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1269194
TreeView+	depends on / blocked

Reported:	2015-07-13 09:38 UTC by Christian Horn
Modified:	2019-09-12 08:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-15 18:16:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
example code, different runtime on rhel5 and rhel6 (158 bytes, application/x-shellscript) 2015-07-13 09:38 UTC, Christian Horn	no flags	Details
fast situation: perf record -o perf_ulimit_u_1024 ./new3.sh (2.89 MB, application/x-gzip) 2015-07-21 14:08 UTC, Christian Horn	no flags	Details
slow situation: perf record -o perf_ulimit_u_7810 ./new3.sh (3.37 MB, application/x-gzip) 2015-07-21 14:10 UTC, Christian Horn	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1444773	0	None	None	None	Never

Description Christian Horn 2015-07-13 09:38:34 UTC

Created attachment 1051348 [details]
example code, different runtime on rhel5 and rhel6

Description of problem:
syscalls seem to take longer on rhel6 than on rhel5

Version-Release number of selected component (if applicable):
All rhel5 seem to perform considerably better than all rhel6 here

How reproducible:
always

Steps to Reproduce:
1. Setup i.e. rhel5.8 and rhel6.5 systems, can be KVM guests
2. $ cat >new3.sh<<EOT
#!/bin/sh

i=10000;

until test $i -eq 1; do

        MYVAR=$(date)
        /bin/sh &

        i=$((i-1));

done

MYPID=$$
top -p $MYPID -n1 |tail -2|head -1
EOT

2. chmod +x new3.sh
3. /usr/bin/time ./new3.sh 

Actual results:
RHEL5:
 9559 root 25   0 64128 1368  948 S  0.0  0.1   0:01.86 new3.sh
3.57user 8.24system 0:11.77elapsed 100%CPU [..]
RHEL6:
28951 root      20   0  105m 2760 1068 S  0.0  0.3   0:05.61 new3.sh
11.83user 8.92system 0:20.81elapsed 99%CPU [..]

Expected results:
RHEL6 should be equally fast

Additional info:
- The originally reported issue was for the OSWatcher script, used for monitoring purposes.  The code above is a version, simplified to the parts which seem performance relevant.
- Besides looking at overall performance, it was noticed that overall "cpu accounting" as shown in top appears higher for rhel6.  We asumed this accounting to be a variant of bz812148, but none of the countermeasures reported to help in other cases did help here.
- Copying bash from rhel5 to rhel6 (plus compat libraries), the runtime stays longer on rhel6

Comment 2 Christian Horn 2015-07-13 09:42:28 UTC

Comparing the time spent in syscalls, from
  $ strace -fc ./new3.sh >&strace_fc_out.txt"

RHEL5:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.61    0.211656           6     37845      9916 wait4
  9.97    0.041700           0    127122      7071 read
  6.83    0.028555           0    320031           mmap
  6.29    0.026293           1     20001           clone
[..]
------ ----------- ----------- --------- --------- ----------------
100.00    0.418186               2413840    104281 total

RHEL6:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 58.25    0.477852          12     39999      9999 wait4
 14.92    0.122369           1    130037      9998 read
  4.48    0.036780           0    310035           mmap
  3.34    0.027412           0    150041        12 open
[..]
------ ----------- ----------- --------- --------- ----------------
100.00    0.820333               2614030    100036 total

Comment 3 Christian Horn 2015-07-13 09:50:55 UTC

The runtimes in the description are just one time runs.  Also looking over multiple runs, the differences stay.  This was taken on a different system,  hp-dl580g5-01.rhts.eng.bos.redhat.com (rhel5.8 vs. rhel6.5):

# looking at "cputime" as shown by "top":
rhel5: min:06.05 max:06.51 samples:50 average:  6.25
rhel6: min:12.26 max:12.52 samples:23 average: 12.39

# the average runtime of the script
rhel5: min:29.73 max:30.16 samples:50 average: 29.94
rhel6: min:48.08 max:48.81 samples:23 average: 48.31   

# average "sys" time from output of "/usr/bin/time ./new3.sh"
rhel5: min:34.40 max:35.12 samples:50 average: 34.82
rhel6: min:38.65 max:39.31 samples:23 average: 38.92

# average "usr"  time from output of "/usr/bin/time ./new3.sh"
rhel5: min:11.80 max:12.24 samples:50 average: 12.04
rhel6: min:24.79 max:25.61 samples:23 average: 25.11

Comment 6 Christian Horn 2015-07-13 13:15:28 UTC

The numbers in #3 are from a physical system, "description" and #2 are from KVM.

Comment 9 Prarit Bhargava 2015-07-13 18:48:03 UTC

(In reply to Christian Horn from comment #3)
> The runtimes in the description are just one time runs.  Also looking over
> multiple runs, the differences stay.  This was taken on a different system, 
> hp-dl580g5-01.rhts.eng.bos.redhat.com (rhel5.8 vs. rhel6.5):
> 
> # looking at "cputime" as shown by "top":
> rhel5: min:06.05 max:06.51 samples:50 average:  6.25
> rhel6: min:12.26 max:12.52 samples:23 average: 12.39
> 
> # the average runtime of the script
> rhel5: min:29.73 max:30.16 samples:50 average: 29.94
> rhel6: min:48.08 max:48.81 samples:23 average: 48.31   
> 
> # average "sys" time from output of "/usr/bin/time ./new3.sh"
> rhel5: min:34.40 max:35.12 samples:50 average: 34.82
> rhel6: min:38.65 max:39.31 samples:23 average: 38.92
> 
> # average "usr"  time from output of "/usr/bin/time ./new3.sh"
> rhel5: min:11.80 max:12.24 samples:50 average: 12.04
> rhel6: min:24.79 max:25.61 samples:23 average: 25.11

chorn, since this is a "new" issue and we're nearing the end of the RHEL6 lifecycle, can you check to see if older versions of RHEL6 also have this problem?

P.

Comment 10 Christian Horn 2015-07-14 10:04:57 UTC

Prarit, seems like this issue has appeared inside of the rhel6 line.

### results for KVM guests (host fedora22)
cputime = "cpu time the process did use up, according to 'top'"
time = apsolute time the script needed to complete

cputime min:02.20 max:02.59 samples:50 average:  2.39  # rhel5.6GA
time    min:13.77 max:14.91 samples:50 average: 14.02

cputime min:02.30 max:02.74 samples:50 average:  2.48  # rhel5.9GA
time    min:13.93 max:14.48 samples:50 average: 14.16

cputime min:02.29 max:02.90 samples:50 average:  2.52  # rhel6.0GA
time    min:10.01 max:11.03 samples:50 average: 10.31

cputime min:02.24 max:03.00 samples:50 average:  2.58  # rhel6.2GA
time    min:09.90 max:11.50 samples:50 average: 10.40

cputime min:02.52 max:03.25 samples:50 average:  2.66  # rhel6.2GA w/ 6.4GA kernel
time    min:09.91 max:12.82 samples:50 average: 10.44

cputime min:02.58 max:03.63 samples:50 average:  2.82  # rhel6.2GA w/ 6.6GA kernel
time    min:09.98 max:13.54 samples:50 average: 10.90

cputime min:04.91 max:06.24 samples:37 average:  5.45  # rhel6.2GA, upgraded to 6.6GA kernel + 6.6 glibc + reboot
time    min:20.88 max:24.04 samples:37 average: 22.06


### results for a physical system, hp-dl165g7-01.rhts.eng.bos.redhat.com

cputime min:03.51 max:03.93 samples:24 average:  3.76  # rhel5.6GA
time    min:21.91 max:22.32 samples:24 average: 22.13

cputime min:05.97 max:06.53 samples:15 average:  6.18  # rhel6.2GA
time    min:20.78 max:21.08 samples:15 average: 20.95

cputime min:13.86 max:14.12 samples:10 average: 13.96  # rhel6.6GA
time    min:43.67 max:43.95 samples:10 average: 43.83

Comment 11 Prarit Bhargava 2015-07-14 12:26:39 UTC

> 
> cputime min:02.24 max:03.00 samples:50 average:  2.58  # rhel6.2GA
> time    min:09.90 max:11.50 samples:50 average: 10.40
> 
> cputime min:02.52 max:03.25 samples:50 average:  2.66  # rhel6.2GA w/ 6.4GA
> kernel
> time    min:09.91 max:12.82 samples:50 average: 10.44
> 
> cputime min:02.58 max:03.63 samples:50 average:  2.82  # rhel6.2GA w/ 6.6GA
> kernel
> time    min:09.98 max:13.54 samples:50 average: 10.90
> 
> cputime min:04.91 max:06.24 samples:37 average:  5.45  # rhel6.2GA, upgraded
> to 6.6GA kernel + 6.6 glibc + reboot
> time    min:20.88 max:24.04 samples:37 average: 22.06
> 
>

So it has crept up a little bit due to the kernel, but the big problem here appears to be glibc.  I'm reassigning this to the glibc component.

P.

Comment 12 Christian Horn 2015-07-14 12:31:06 UTC

Bisecting the RHEL6 minor versions and then packages leading to the issue, it turns out that RHEL6.6 with the RHEL6.2 pam package executes the script as fast as pure RHEL6.2 does.

Upgrading/downgrading the pam package, logging out/in via SSH, and then executing the testcode switches between the 2 behaviours.

Although this is down to "something with pam", I am still unsure how it would influence the syscall time.  Comparing "strace -fc .." of running the testcode with old vs. new pam package installed:

new/probematic pam package installed:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 66.07    0.491853          12     39999      9999 wait4
 12.13    0.090276           1    130037      9997 read
  4.72    0.035118           0    310035           mmap
  2.51    0.018678           0    230042         4 close

old/fast pam package installed:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 46.61    0.205347           5     39999      9998 wait4
 13.82    0.060866           0    130037      9997 read
  9.06    0.039923           0    310035           mmap
  8.70    0.038331           0    320031           rt_sigaction
  4.39    0.019324           0    230042         4 close
[...]

Comment 13 Christian Horn 2015-07-14 12:32:15 UTC

Thanks prarit.
Pure glibc and/or kernel changes have now indeed turned out to not influence the issue.

Comment 14 Christian Horn 2015-07-15 13:06:49 UTC

Downstripped reproducer:

1. Setup a rhel6.4 system or newer, can be KVM guest (
2. $ cat >new3.sh<<EOT
#!/bin/sh

i=10000;

until test $i -eq 1; do

        MYVAR=$(date)
        /bin/sh &

        i=$((i-1));

done

MYPID=$$
top -p $MYPID -n1 |tail -2|head -1
EOT

2. chmod +x new3.sh
3. /usr/bin/time ./new3.sh 

Actual results:
11847 root      20   0  104m 2052 1052 S  0.0  0.4   0:04.23 new3.sh 
7.41user 5.61system 0:16.39elapsed 79%CPU (0avgtext+0avgdata 8208maxresident)k
0inputs+0outputs (0major+8817338minor)pagefaults 0swaps
=> so "4.23" showed in tops cpu-column
=> overall runtime 16sec
This is with pam-1.1.1-13.el6.x86_64 or pam-1.1.1-20.el6.x86_64

Expected results:
31887 root      20   0  103m 1476 1052 S  0.0  0.3   0:02.68 new3.sh
1.24user 1.59system 0:09.98elapsed 28%CPU (0avgtext+0avgdata 5904maxresident)k
0inputs+0outputs (0major+7938250minor)pagefaults 0swaps
=> so "2.68" showed in tops cpu-column
=> overall runtime 9.98sec
This is with pam downgraded to pam-1.1.1-10.el6.x86_64 (and logout/login)

Additional info:
- The originally reported issue was for the OSWatcher script, used for monitoring purposes.  The code above is a version, simplified to the parts which seem performance relevant.
- Besides looking at overall performance, it was noticed that overall "cpu accounting" as shown in top appears higher for rhel6.  We asumed this accounting to be a variant of bz812148, but none of the countermeasures reported to help in other cases did help here.

Comment 15 Christian Horn 2015-07-15 14:43:43 UTC

The issue is
  root       soft    nproc     unlimited
which we have in pam-1.1.1-12.el6 and later.  From changelog:

* 金 10月 12 2012 Tomas Mraz <tmraz> 1.1.1-12
[..]
- relax soft nproc limit for root in 90-nproc.conf

So deploying a 6.6.z system, the testcode 
runs 19.00sec in average (variations between 17.40 and 20.37seconds)
With the nproc line removed (as its in pam-1.1.1-10.el6), it
runs 12.21sec in average (variations between 11.66 and 13.74seconds)

This also affects rhel7 btw.
I have no idea how to investigate this deeper.

Comment 16 Carlos O'Donell 2015-07-16 00:35:37 UTC

(In reply to Christian Horn from comment #15)
> The issue is
>   root       soft    nproc     unlimited
> which we have in pam-1.1.1-12.el6 and later.  From changelog:
> 
> * 金 10月 12 2012 Tomas Mraz <tmraz> 1.1.1-12
> [..]
> - relax soft nproc limit for root in 90-nproc.conf
> 
> So deploying a 6.6.z system, the testcode 
> runs 19.00sec in average (variations between 17.40 and 20.37seconds)
> With the nproc line removed (as its in pam-1.1.1-10.el6), it
> runs 12.21sec in average (variations between 11.66 and 13.74seconds)
> 
> This also affects rhel7 btw.
> I have no idea how to investigate this deeper.

I suggest starting by asking the pam team to explain why they changed it and to review the performance issues.

Comment 17 Tomas Mraz 2015-07-16 08:39:31 UTC

We changed the limit because the previous limit was too strong for the normal use-cases.

What happens if the file is removed altogether - i.e. there is no nproc setting in pam configuration at all?

It should be probably reassigned to kernel as the source of the degradation is not pam (it did only a configuration change of proces limits) and I have no idea how to investigate what happens in kernel for various nproc settings.

Comment 18 Christian Horn 2015-07-17 06:12:44 UTC

Thanks for the comment.

The slow/fast behaviour can be switched in using the "root soft nproc unlimited" line, vs. commenting it out or deleting it.

Comment 19 Tomas Mraz 2015-07-17 08:18:53 UTC

But I would like to know what happens if you remove both nproc lines in the file. Is the performance same as if only the root line is removed?

Comment 20 Christian Horn 2015-07-17 15:18:40 UTC

Complete file removed results in bad performance:

# Just "*          soft    nproc     1024" in the file:
cputime min:02.39 max:02.69 samples:8 average: 2.51
time    min:09.08 max:09.99 samples:8 average: 9.52

# having file /etc/security/limits.d/90-nproc.conf removed
cputime min:03.68 max:04.06 samples:11 average:  3.80
time    min:14.65 max:16.05 samples:11 average: 15.01

# having both lines, as shipped with pam-1.1.1-12.el6 and later
cputime min:03.68 max:04.13 samples:12 average:  3.91
time    min:14.60 max:16.05 samples:12 average: 15.30

Comment 21 Tomas Mraz 2015-07-17 15:36:01 UTC

Unfortunately that means we can hardly do anything with the regression in PAM. Limiting again the root to nproc = 1024 is not an option.

Comment 22 Christian Horn 2015-07-20 15:07:43 UTC

I verified to see the fast behaviour in a "ulimit -u 1024" environment and the slow behaviour in a "ulimit -u 7810" environment, will run perf with both.

Comment 23 Tomas Mraz 2015-07-21 09:21:34 UTC

Not sure what to do further with the issue, reassigning to kernel for consideration.

Comment 24 Christian Horn 2015-07-21 14:08:22 UTC

Created attachment 1054410 [details]
fast situation: perf record -o perf_ulimit_u_1024 ./new3.sh

# changing ulimit
ulimit -u 1024
# verifying we have the fast behaviour
/usr/bin/time ./new3.sh
# running perf
perf record -o perf_ulimit_u_1024 ./new3.sh

Comment 25 Christian Horn 2015-07-21 14:10:18 UTC

Created attachment 1054412 [details]
slow situation: perf record -o perf_ulimit_u_7810 ./new3.sh

# changing ulimit
ulimit -u 7810
# verifying we have the slow behaviour
/usr/bin/time ./new3.sh
# running perf
perf record -o perf_ulimit_u_7810 ./new3.sh

Comment 26 Christian Horn 2015-07-21 14:11:53 UTC

Attached perf outputs from "ulimit -u 1024" and "-u 7810" environments.
Does this or something else help pin the issue down to kernel, bash, glibc or something else?

Comment 27 Christian Horn 2015-07-28 10:50:35 UTC

Hello kernel specialists, so we have the issue pinned down to the following.

[root@rhel6u5a test_shell_kvm]# cat new3.sh 
#!/bin/sh
i=10000;
until test $i -eq 1; do
        MYVAR=$(date)
        /bin/sh &
        i=$((i-1));
done
MYPID=$$
top -p $MYPID -n1 |tail -2|head -1
[root@rhel6u5a test_shell_kvm]# ulimit -u 7810
[root@rhel6u5a test_shell_kvm]# ulimit -u
7810
[root@rhel6u5a test_shell_kvm]# /usr/bin/time ./new3.sh 
29017 root      20   0  105m 2756 1068 S  0.0  0.3   0:04.85 new3.sh
9.35user 5.77system 0:19.51elapsed 77%CPU (0avgtext+0avgdata 11024maxresident)k
[root@rhel6u5a test_shell_kvm]# ulimit -u 1024
[root@rhel6u5a test_shell_kvm]# ulimit -u
1024
[root@rhel6u5a test_shell_kvm]# /usr/bin/time ./new3.sh 
11801 root      20   0  103m 1480 1068 S  0.0  0.1   0:02.56 new3.sh
1.12user 1.59system 0:09.49elapsed 28%CPU (0avgtext+0avgdata 5920maxresident)k
0inputs+0outputs (0major+8064654minor)pagefaults 0swaps

- performance of the code is much worse with increased "max processes"
- the "-u 1024" was our default on rhel6.2 and earlier, the "7810" is default on i.e. rhel6.5
- #24 has a perf output of the "ulimit -u 1024" case
- #25 has a perf output of the "ulimit -u 7810" case

Could you help finding out what is causing this?

Comment 28 Christian Horn 2015-08-05 13:22:01 UTC

perf outputs for the script from #27 , to compare both environments:

$ ulimit -u 1024
$ perf record -a -o perf_a_1024 ./new3.sh
$ perf report -k /boot/vmlinuz-2.6.32-431.59.1.el6.x86_64.debug \
    --show-total-period -i perf_a_1024
[..]
 22.66%         19626         date  [kernel.kallsyms]   [k] 0xffffffff810a0440
 20.77%         17992           sh  [kernel.kallsyms]   [k] 0xffffffff8104fc30
  7.43%          6434      new3.sh  [kernel.kallsyms]   [k] 0xffffffff8129520f
  6.70%          5806         init  [kernel.kallsyms]   [k] 0xffffffff8103eacb
  6.50%          5635      swapper  [kernel.kallsyms]   [k] 0xffffffff8103eacb
  3.33%          2882      new3.sh  bash                [.] 0x0000000000067959
  2.43%          2104           sh  libc-2.12.so        [.] __mbrtowc
  2.29%          1987         date  bash                [.] 0x000000000003e713
[..]

$ ulimit -u 7810
$ perf record -a -o perf_a_7810 ./new3.sh
$ perf report -k /boot/vmlinuz-2.6.32-431.59.1.el6.x86_64.debug \
    --show-total-period -i perf_a_7810
[..]
 19.15%         32403         date  [kernel.kallsyms]   [k] 0xffffffff8114da12
 14.51%         24551         init  [kernel.kallsyms]   [k] 0xffffffff8103eacb
 13.86%         23446      swapper  [kernel.kallsyms]   [k] 0xffffffff8103eacb
 12.42%         21015           sh  [kernel.kallsyms]   [k] 0xffffffff8112d20a
  6.14%         10396      new3.sh  bash                [.] 0x000000000003e660
  4.64%          7848         date  bash                [.] 0x000000000003e713
  4.26%          7201      new3.sh  [kernel.kallsyms]   [k] 0xffffffff81153a36
  3.90%          6596         date  libc-2.12.so        [.] _int_free
  2.00%          3378         date  bash                [.] delete_job
  1.65%          2800         date  libc-2.12.so        [.] free
  1.26%          2134           sh  libc-2.12.so        [.] __mbrtowc

Any ideas for the reason of this?

Comment 29 Christian Horn 2015-08-13 12:31:01 UTC

Performance is not abruptly getting worse but gradually.  Setting "ulimit -u" to increasing values and measuring the code runtime:

$ for i in 16 32 64 128 256 512 768 1024 2048 4096 8192 16284; do ulimit -u $i ; echo -ne "$i : ";/usr/bin/time ./new4.sh 2>&1|grep elaps|sed -e 's,elaps.*,,' -e 's,.* ,,'; done
16 : 0:08.11
32 : 0:06.80
64 : 0:07.52
128 : 0:07.11
256 : 0:08.52
512 : 0:08.83
768 : 0:08.89
1024 : 0:09.64  <-- this is the default we shipped rhel6.0 with
2048 : 0:12.25
4096 : 0:16.47
8192 : 0:20.32  <-- this is roughly what we have in rhel6.3+
16284 : 0:21.04

Comment 30 Christian Horn 2015-08-13 12:50:03 UTC

On rhel5, I have identical execution times for the code.  For "ulimit -u" ranging from 16 to 16284, I get execution times between 12.36 and 12.62 seconds.

RHEL7 has the same issue as rhel6, output quite similiar to #29.

Comment 31 Milos Vyletel 2015-08-13 14:01:35 UTC

    rhel6 data

    maxproc 1024

    -   16.16%     0.54%  new3.sh  bash                [.] make_child                             ◆
         make_child                                                                               ▒
    -   12.28%     0.23%  new3.sh  libc-2.12.so        [.] __libc_fork                            ▒
       - __libc_fork                                                                              ▒
            100.00% make_child                                                                    ▒
    +    3.96%     3.89%       sh  libc-2.12.so        [.] __gconv_transform_utf8_internal        ▒
    +    3.65%     3.65%       sh  [kernel.kallsyms]   [k] __do_page_fault                        ▒
    +    3.42%     0.23%       sh  ld-2.12.so          [.] _dl_sysdep_start                       ▒
    +    3.11%     3.11%     date  [kernel.kallsyms]   [k] __do_page_fault                        ▒
    +    2.95%     0.31%       sh  ld-2.12.so          [.] dl_main                                ▒
    +    2.80%     2.80%  new3.sh  [kernel.kallsyms]   [k] __do_page_fault                        ▒
    +    2.72%     2.72%  new3.sh  [kernel.kallsyms]   [k] _spin_unlock_irqrestore                ▒
    +    2.49%     0.16%     date  ld-2.12.so          [.] _dl_sysdep_start                       ▒
    +    2.33%     2.33%       sh  libc-2.12.so        [.] __mbrtowc                              ▒
    +    2.18%     0.08%  new3.sh  libc-2.12.so        [.] __xstat64                              ▒
    +    2.10%     0.00%  new3.sh  libc-2.12.so        [.] __execve                               ▒
    +    2.10%     0.70%       sh  ld-2.12.so          [.] _dl_relocate_object                    ▒
    +    2.02%     2.02%  new3.sh  [kernel.kallsyms]   [k] kmem_cache_alloc                       ▒
    +    1.63%     1.40%       sh  libc-2.12.so        [.] _int_malloc                            ▒
    +    1.48%     0.08%     date  ld-2.12.so          [.] dl_main                                ▒
    +    1.48%     1.48%  new3.sh  [kernel.kallsyms]   [k] get_page_from_freelist                 ▒
    +    1.40%     0.47%  new3.sh  bash                [.] hash_search                            ▒
    +    1.40%     0.00%  new3.sh  libc-2.12.so        [.] __sigprocmask                          ▒
    +    1.40%     1.40%  new3.sh  [kernel.kallsyms]   [k] flush_tlb_others_ipi                   ▒
    +    1.40%     0.23%  new3.sh  libc-2.12.so        [.] ptmalloc_unlock_all                    ▒
    +    1.32%     1.24%       sh  ld-2.12.so          [.] do_lookup_x                            ▒
    +    1.32%     0.78%  new3.sh  libc-2.12.so        [.] _int_malloc           

    maxproc 7180

    -    7.73%     0.06%  new3.sh  bash                [.] make_child                             ◆
         make_child                                                                               ▒
    -    5.95%     0.00%  new3.sh  libc-2.12.so        [.] __libc_fork                            ▒
       - __libc_fork                                                                              ▒
            100.00% make_child                                                                    ▒
    +    5.46%     5.46%       sh  [kernel.kallsyms]   [k] __do_page_fault                        ▒
    +    4.97%     4.90%       sh  libc-2.12.so        [.] __gconv_transform_utf8_internal        ▒
    +    4.90%     0.31%       sh  ld-2.12.so          [.] _dl_sysdep_start                       ▒
    +    4.72%     4.41%       sh  libc-2.12.so        [.] __mbrtowc                              ▒
    +    3.74%     0.06%       sh  ld-2.12.so          [.] dl_main                                ▒
    +    3.13%     3.13%     date  [kernel.kallsyms]   [k] __do_page_fault                        ▒
    +    3.00%     0.12%     date  ld-2.12.so          [.] _dl_sysdep_start                       ▒
    +    2.82%     2.82%  new3.sh  [kernel.kallsyms]   [k] __do_page_fault                        ▒
    +    2.27%     2.15%       sh  ld-2.12.so          [.] do_lookup_x                            ▒
    +    2.27%     0.86%       sh  ld-2.12.so          [.] _dl_relocate_object                    ▒
    +    2.02%     0.00%  new3.sh  libc-2.12.so        [.] __xstat64                              ▒
    +    1.84%     0.12%     date  ld-2.12.so          [.] dl_main                                ▒
    +    1.66%     0.00%       sh  ld-2.12.so          [.] _dl_map_object                         ▒
    +    1.53%     0.12%     date  ld-2.12.so          [.] _dl_map_object                         ▒
    +    1.41%     0.00%       sh  libc-2.12.so        [.] __gconv_find_transform                 ▒
    +    1.41%     0.06%     date  libc-2.12.so        [.] setlocale                              ▒
    +    1.29%     1.23%     date  libc-2.12.so        [.] _dl_addr                               ▒
    +    1.29%     0.00%     date  ld-2.12.so          [.] mmap64                                 ▒
    +    1.29%     0.00%       sh  ld-2.12.so          [.] mmap64                                 ▒
    +    1.29%     0.43%     date  ld-2.12.so          [.] _dl_relocate_object                    ▒
    +    1.10%     0.80%       sh  ld-2.12.so          [.] check_match.12445                      ▒
    +    1.10%     0.00%     date  libc-2.12.so        [.] _nl_find_locale

Comment 32 Milos Vyletel 2015-08-13 14:13:18 UTC

tcsh on both rhel6 and rhel7 does not have same problem so this may not be related to kernel but rather bash (just a theory)

[root@mv-rhel6 ~]# cat new4.sh
#!/bin/tcsh
foreach i (`seq 1 10000`)
           set myvar = `date`
           /bin/tcsh &
end
[root@mv-rhel6 ~]# cat run.sh 
#!/bin/tcsh
set limit = 0
foreach u (`seq 1 7`)
    @ limit = ($u * 1024)
    limit maxproc $limit
    limit maxproc
    time ./new4.sh >&/dev/null
end

rhel7

[root@mv-rhel7 ~]# ./run.sh 
maxproc      1024 
173.945u 238.762s 2:10.34 316.6%    0+0k 0+239696io 0pf+0w
maxproc      2048 
173.732u 238.793s 2:12.82 310.5%    0+0k 0+239624io 0pf+0w
maxproc      3072 
174.404u 244.343s 2:16.82 306.0%    0+0k 0+239672io 0pf+0w
maxproc      4096 
173.690u 241.411s 2:13.47 311.0%    0+0k 0+239720io 0pf+0w
maxproc      5120 
173.450u 234.367s 2:15.32 301.3%    0+0k 0+239544io 0pf+0w
maxproc      6144 
175.205u 236.809s 2:15.85 303.2%    0+0k 0+239720io 0pf+0w
maxproc      7168 
173.465u 232.450s 2:11.56 308.5%    0+0k 0+239552io 0pf+0w

rhel6

[root@mv-rhel6 ~]# ./run.sh 
maxproc      1024 
183.297u 113.916s 1:24.53 351.5%    0+0k 0+0io 0pf+0w
maxproc      2048 
187.895u 117.672s 1:26.84 351.8%    0+0k 0+0io 0pf+0w
maxproc      3072 
171.392u 102.928s 1:18.47 349.5%    0+0k 0+0io 0pf+0w
maxproc      4096 
173.254u 104.116s 1:19.34 349.5%    0+0k 0+0io 0pf+0w
maxproc      5120 
171.319u 102.730s 1:18.53 348.9%    0+0k 0+0io 0pf+0w
maxproc      6144 
179.279u 109.541s 1:22.44 350.3%    0+0k 0+0io 0pf+0w
maxproc      7168 
167.265u 99.288s 1:16.38 348.9%    0+0k 0+0io 0pf+0w

Comment 33 Christian Horn 2015-08-14 10:54:58 UTC

Recompiling bash-3.2 on rhel6, I see the same behaviour as for the native rhel6-bash we ship:

$ for i in 16 32 64 128 256 512 768 1024 2048 4096 8192 16284; do ulimit -u $i ; echo -ne "$i : ";/usr/bin/time ./new4_bash3.2.sh 2>&1|grep elaps|sed -e 's,elaps.*,,' -e 's,.* ,,'; done
16 : 0:07.35
32 : 0:09.41
64 : 0:09.49
128 : 0:07.33
256 : 0:09.20
512 : 0:08.65
768 : 0:09.54
1024 : 0:10.40
2048 : 0:16.25
4096 : 0:22.31
8192 : 0:28.86
16284 : 0:28.68

=> So it's rather not the bash, but kernel or glibc.  Could please someone suggest how to proceed best here to pin this down?  I think this is a degradation which might already hit many customers.

Comment 34 Tomas Mraz 2015-08-14 11:10:34 UTC

The reason why on RHEL-6 with limited nproc the performance is good is because what the script does is just running thousands of sh child processes in parallel. In case of the low nproc limit it will limit the number of the child processes which of course improves run time of the script because the number of processes to run will be lower. But the real reason why the startup is slower on RHEL-6 than on RHEL-5 is unknown.

Comment 35 Christian Horn 2015-08-17 11:08:43 UTC

Do you mean processes to be placed in background?  That is not the case.  To verify I was running
  $ while :; do ps ax|wc -l >>process_counter_1024.txt; sleep .1; done
while running the testcode.  In both cases the code only runs few additional processes, nowhere in the area of 'ulimit -u':

[root@rhel6u5a ~]# sort process_counter_1024.txt|uniq -c
     51 106
     62 108
     26 109
      4 111
[root@rhel6u5a ~]# sort process_counter_7810.txt|uniq -c
     49 106
    148 108
     41 109
      4 111
[root@rhel6u5a ~]#

Comment 36 Rodrigo A B Freire 2015-08-26 17:34:36 UTC

I was discussing this with a Sr Engineer friend of mine, and the following topic was brought;

Did we test how RHEL5 behaves with the higher limits?

- RF.

Comment 37 Christian Horn 2015-08-26 20:34:01 UTC

rhel6.5  "ulimit -u 7810" average runtime over  4 runs: 21.4sec
rhel5.11 "ulimit -u 7810" average runtime over  9 runs: 12.7sec
rhel6.5  "ulimit -u 1024" average runtime over 12 runs: 10.2sec
rhel5.11 "ulimit -u 1024" average runtime over  5 runs: 12.6sec

Comment 38 Tomas Mraz 2015-08-27 07:30:51 UTC

To isolate whether the problem is in kernel or in shell or glibc it would be really useful to have an reproducer that is using a C code and not a shell.

Comment 39 Christian Horn 2015-08-27 07:40:30 UTC

(In reply to Tomas Mraz from comment #38)
> To isolate whether the problem is in kernel or in shell or glibc it would be
> really useful to have an reproducer that is using a C code and not a shell.
Thanks for pointing out a possible direction.. but atleast I have no idea how to come up with C code also showing the issue.

As far as I understand, such c-code, also showing the same behaviour, would probably do the same interactions with glibc and kernel (or ideally a subset, which also shows the performance degradation).

Comment 40 Christian Horn 2015-08-27 08:50:59 UTC

The canonical c-code one would think of is executed equally fast in both environments:
----
#include <stdio.h>
#include <stdlib.h>

int main(void) {
        int i, myvar, myvar2;
        for (i=1; i<2000; i++) {
                myvar=system("date >/dev/null");
                system("/bin/sh &");
        }
        return 0;
}
----

Comment 41 Tomas Mraz 2015-08-27 09:24:43 UTC

It then really looks like a bash issue and not kernel or glibc.

Comment 42 Thomas Gardner 2015-09-01 15:02:18 UTC

Christian,

Let's not give up quite so quickly.  Have you tried your reproducer in ksh and ash and (PickYourFavorite)sh yet?  That seems like the next most logical step.  If any other shells have the same problem, maybe we should look at your C reproducer.

Also, have we profiled the shell with this running to see if there's one thing it's getting stuck in?  This has been going on long enough, I forget what's been tried and what hasn't.

Comment 43 Christian Horn 2015-09-09 12:02:30 UTC

Thanks Tom.

Runtimes for the code with other shells:

# code runtimes w/ 'ulimit -u 1024' environment
bash: min:09.57 max:10.47 samples:20 average: 9.77
ksh:  min:07.34 max:09.28 samples:20 average: 8.22
dash: min:07.47 max:09.24 samples:20 average: 8.21
ash:  min:07.21 max:08.91 samples:20 average: 8.04

# code runtimes w/ 'ulimit -u 7810' environment
bash: min:19.59 max:22.22 samples:20 average: 20.10
ksh:  min:09.02 max:11.43 samples:20 average: 10.01
dash: min:07.38 max:09.31 samples:20 average: 7.85
ash:  min:07.15 max:08.99 samples:20 average: 7.83

=> bash is drastically slower in the "-u 7810" environment
=> ksh is a bit slower, dash and ash/busybox seem not affected

> Also, have we profiled the shell with this running to see
> if there's one thing it's getting stuck in?
I do not think so, and I am not able to do that/sure what you refer to.  Any bare metal system or virtual guest can be used to see the issue.

Comment 44 Thomas Gardner 2015-09-10 16:23:01 UTC

(In reply to Christian Horn from comment #43)
> Thanks Tom.
> 
> Runtimes for the code with other shells:
> 
> [...]

Now there's some good data right there, I tell you what.

Will the RHEL6 bash run on RHEL5 or vice versa?  If so, I'd love to
see what happens when you do the same stuff with the RHEL6 bash on a
RHEL5 system (and to a lesser extent a RHEL5 bash on a RHEL6 system,
but the other way around is probably more interesting).  It's good to
see that whatever the RHEL6 bash is doing will bring the problem out
(hopefully that will help find the problem), but if the RHEL6 bash
only does it on a RHEL6 system and not when placed on a RHEL5 system,
it's probably not bash causing it.  In that case, bash is probably
just shedding light on whatever the real problem is in RHEL6 in ways
that the other shells don't.

I'll try to fool with that if I can get to it, but I thought I'd put
it out there in case someone else can get to it first.

> > Also, have we profiled the shell with this running to see
> > if there's one thing it's getting stuck in?
> I do not think so, and I am not able to do that/sure what you
> refer to.  Any bare metal system or virtual guest can be used to
> see the issue.

OK, I will try to figure out how to do that.  I remember it's not
terribly difficult, but I don't remember how to do it off the top
of my head.

Comment 45 Milos Vyletel 2015-10-02 12:43:36 UTC

Run couple more tests on my laptop so tested recreator on

fedora 22; 4.3.0-0.rc3.git1.1.fc24.x86_64; bash-4.3.42-1.fc22.x86_64
rhel5; 2.6.18-164.9.1.el5; bash-3.2-33.el5_11.4
rhel6; 2.6.32-573.3.1.el6.x86_64; bash-4.1.2-33.el6.x86_64
rhel7; 3.10.0-302.el7.test.x86_64; bash-4.2.46-19.el7.x86_64

which is laptop(f22) and VMs with 4vcpus;4G each

the noticable difference is in number of pagefaults that occur during the run

rhel5
1024
13907 root      23   0 64112 1352  932 S  0.0  0.0   0:04.65 new3.sh                           
13.91user 19.63system 0:21.91elapsed 153%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7478505minor)pagefaults 0swaps
2048
 1459 root      25   0 64112 1352  932 S  0.0  0.0   0:04.45 new3.sh                           
13.83user 18.74system 0:21.03elapsed 154%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7476621minor)pagefaults 0swaps
3072
21489 root      25   0 64112 1352  932 S  0.0  0.0   0:04.48 new3.sh                           
13.93user 19.06system 0:21.31elapsed 154%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7494993minor)pagefaults 0swaps
4096
 9068 root      25   0 64112 1356  932 S  0.0  0.0   0:04.57 new3.sh                           
14.08user 19.10system 0:21.61elapsed 153%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7467911minor)pagefaults 0swaps
5120
29071 root      20   0 64112 1356  932 S  0.0  0.0   0:04.28 new3.sh                           
13.73user 18.78system 0:21.13elapsed 153%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7496351minor)pagefaults 0swaps
6144
16650 root      25   0 64112 1356  932 S  0.0  0.0   0:04.37 new3.sh                           
13.63user 19.13system 0:21.23elapsed 154%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7468460minor)pagefaults 0swaps
7168
 4229 root      24   0 64112 1356  932 S  0.0  0.0   0:04.39 new3.sh                           
13.65user 19.04system 0:21.13elapsed 154%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7492852minor)pagefaults 0swaps

rhel6
1024
 1473 root      20   0  103m 1632 1168 S  0.0  0.2   0:07.93 new3.sh                           
17.14user 21.45system 0:26.06elapsed 148%CPU (0avgtext+0avgdata 1632maxresident)k
576inputs+0outputs (4major+8998255minor)pagefaults 0swaps
2048
21476 root      20   0  103m 1840 1168 S  0.0  0.2   0:08.64 new3.sh                           
20.49user 22.92system 0:29.68elapsed 146%CPU (0avgtext+0avgdata 1840maxresident)k
0inputs+0outputs (0major+9373339minor)pagefaults 0swaps
3072
 9070 root      20   0  104m 2076 1168 S  0.0  0.2   0:10.59 new3.sh                           
25.16user 26.97system 0:35.08elapsed 148%CPU (0avgtext+0avgdata 2076maxresident)k
0inputs+0outputs (0major+9721730minor)pagefaults 0swaps
4096
29081 root      20   0  104m 2228 1168 S  0.0  0.3   0:10.90 new3.sh                           
26.09user 26.09system 0:36.83elapsed 141%CPU (0avgtext+0avgdata 2228maxresident)k
0inputs+0outputs (0major+9975488minor)pagefaults 0swaps
5120
16683 root      20   0  104m 2432 1168 S  0.0  0.3   0:11.22 new3.sh                           
27.40user 25.96system 0:38.13elapsed 139%CPU (0avgtext+0avgdata 2432maxresident)k
0inputs+0outputs (0major+10152760minor)pagefaults 0swaps
6144
 4257 root      20   0  104m 2588 1168 S  0.0  0.3   0:11.49 new3.sh                           
28.04user 26.82system 0:39.00elapsed 140%CPU (0avgtext+0avgdata 2588maxresident)k
0inputs+0outputs (0major+10298784minor)pagefaults 0swaps
7168
24260 root      20   0  104m 2760 1168 S  0.0  0.3   0:13.54 new3.sh                           
31.27user 32.36system 0:44.71elapsed 142%CPU (0avgtext+0avgdata 2760maxresident)k
0inputs+0outputs (0major+10423502minor)pagefaults 0swaps

rhel7
1024
16.24user 42.79system 0:38.68elapsed 152%CPU (0avgtext+0avgdata 1904maxresident)k
0inputs+0outputs (0major+8906379minor)pagefaults 0swaps
2048
19.36user 34.74system 0:31.94elapsed 169%CPU (0avgtext+0avgdata 1916maxresident)k
0inputs+0outputs (0major+9319956minor)pagefaults 0swaps
3072
22.79user 35.96system 0:35.67elapsed 164%CPU (0avgtext+0avgdata 2132maxresident)k
0inputs+0outputs (0major+9683267minor)pagefaults 0swaps
4096
24.31user 35.85system 0:36.90elapsed 163%CPU (0avgtext+0avgdata 2300maxresident)k
0inputs+0outputs (0major+9835529minor)pagefaults 0swaps
5120
26.87user 40.50system 0:44.15elapsed 152%CPU (0avgtext+0avgdata 2492maxresident)k
0inputs+0outputs (0major+10087698minor)pagefaults 0swaps
6144
27.02user 37.45system 0:40.90elapsed 157%CPU (0avgtext+0avgdata 2684maxresident)k
0inputs+0outputs (0major+10227271minor)pagefaults 0swaps
7168
27.85user 38.15system 0:42.63elapsed 154%CPU (0avgtext+0avgdata 2844maxresident)k
0inputs+0outputs (0major+10327962minor)pagefaults 0swaps

f22
1024
13.00user 19.95system 0:16.55elapsed 199%CPU (0avgtext+0avgdata 4676maxresident)k
0inputs+0outputs (0major+3901010minor)pagefaults 0swaps
2048
15.61user 21.58system 0:20.48elapsed 181%CPU (0avgtext+0avgdata 4704maxresident)k
0inputs+0outputs (0major+4332896minor)pagefaults 0swaps
3072
18.24user 22.53system 0:23.79elapsed 171%CPU (0avgtext+0avgdata 4708maxresident)k
0inputs+0outputs (0major+4700756minor)pagefaults 0swaps
4096
19.02user 19.07system 0:22.31elapsed 170%CPU (0avgtext+0avgdata 4640maxresident)k
0inputs+0outputs (0major+4928832minor)pagefaults 0swaps
5120
20.71user 20.89system 0:25.10elapsed 165%CPU (0avgtext+0avgdata 4640maxresident)k
0inputs+0outputs (0major+5106694minor)pagefaults 0swaps
6144
21.10user 21.88system 0:25.41elapsed 169%CPU (0avgtext+0avgdata 4708maxresident)k
0inputs+0outputs (0major+5234781minor)pagefaults 0swaps
7168
22.42user 22.80system 0:27.29elapsed 165%CPU (0avgtext+0avgdata 4704maxresident)k
0inputs+0outputs (0major+5307505minor)pagefaults 0swaps

As seen above f22 is behaving slightly better than rhel6/7 but is not as consistent
as rhel5. The number of interrupts is more or less the same on rhel5 but increases
with limit on rhel6+ with rhel7 having least among the others so I'd say some
kernel optimization kicked in and helped a bit. 

I'm building 4.3.0-rc3 kernel for rhel7 and see if that improves things

Comment 46 Milos Vyletel 2015-10-02 13:35:27 UTC

while waiting for kernel build I ran more tests on rhel7

recreator without MYVAR=$(date) consistent number of pagefaults and runtime
1024
7.10user 18.20system 0:07.04elapsed 359%CPU (0avgtext+0avgdata 1680maxresident)k
0inputs+0outputs (0major+4690456minor)pagefaults 0swaps
2048
7.43user 17.99system 0:07.17elapsed 354%CPU (0avgtext+0avgdata 1884maxresident)k
0inputs+0outputs (0major+4627062minor)pagefaults 0swaps
3072
8.00user 17.89system 0:07.36elapsed 351%CPU (0avgtext+0avgdata 2100maxresident)k
0inputs+0outputs (0major+4631613minor)pagefaults 0swaps
4096
8.46user 18.00system 0:07.34elapsed 360%CPU (0avgtext+0avgdata 2348maxresident)k
0inputs+0outputs (0major+4605916minor)pagefaults 0swaps
5120
9.00user 17.94system 0:07.59elapsed 354%CPU (0avgtext+0avgdata 2552maxresident)k
0inputs+0outputs (0major+4621905minor)pagefaults 0swaps
6144
8.31user 18.82system 0:07.70elapsed 352%CPU (0avgtext+0avgdata 2740maxresident)k
0inputs+0outputs (0major+4593514minor)pagefaults 0swaps
7168
9.31user 18.22system 0:07.81elapsed 352%CPU (0avgtext+0avgdata 2888maxresident)k
0inputs+0outputs (0major+4590491minor)pagefaults 0swaps


recreator without bash fork; consistent number of pagefaults and runtime
1024
2.87user 9.40system 0:11.89elapsed 103%CPU (0avgtext+0avgdata 1404maxresident)k
0inputs+0outputs (0major+3436580minor)pagefaults 0swaps
2048
2.79user 9.40system 0:11.86elapsed 102%CPU (0avgtext+0avgdata 1404maxresident)k
0inputs+0outputs (0major+3436529minor)pagefaults 0swaps
3072
2.91user 10.00system 0:12.61elapsed 102%CPU (0avgtext+0avgdata 1404maxresident)k
0inputs+0outputs (0major+3436563minor)pagefaults 0swaps
4096
2.87user 9.39system 0:11.80elapsed 103%CPU (0avgtext+0avgdata 1400maxresident)k
0inputs+0outputs (0major+3456548minor)pagefaults 0swaps
5120
2.80user 9.32system 0:11.71elapsed 103%CPU (0avgtext+0avgdata 1400maxresident)k
0inputs+0outputs (0major+3436635minor)pagefaults 0swaps
6144
2.92user 9.26system 0:11.65elapsed 104%CPU (0avgtext+0avgdata 1404maxresident)k
0inputs+0outputs (0major+3436567minor)pagefaults 0swaps
7168
2.91user 9.46system 0:12.05elapsed 102%CPU (0avgtext+0avgdata 1404maxresident)k
0inputs+0outputs (0major+3436560minor)pagefaults 0swaps


myvar=$(date); date &; pagefautls/runtime increasing
1024
10.17user 25.23system 0:22.30elapsed 158%CPU (0avgtext+0avgdata 1684maxresident)k
0inputs+0outputs (0major+7727301minor)pagefaults 0swaps
2048
12.98user 27.80system 0:27.27elapsed 149%CPU (0avgtext+0avgdata 1912maxresident)k
0inputs+0outputs (0major+8118169minor)pagefaults 0swaps
3072
15.42user 29.09system 0:30.52elapsed 145%CPU (0avgtext+0avgdata 2136maxresident)k
0inputs+0outputs (0major+8449217minor)pagefaults 0swaps
4096
17.56user 30.50system 0:33.99elapsed 141%CPU (0avgtext+0avgdata 2296maxresident)k
0inputs+0outputs (0major+8595903minor)pagefaults 0swaps
5120
19.66user 30.85system 0:35.99elapsed 140%CPU (0avgtext+0avgdata 2492maxresident)k
0inputs+0outputs (0major+8839141minor)pagefaults 0swaps
6144
20.95user 32.14system 0:38.45elapsed 138%CPU (0avgtext+0avgdata 2688maxresident)k
0inputs+0outputs (0major+9001505minor)pagefaults 0swaps
7168
21.25user 32.49system 0:39.03elapsed 137%CPU (0avgtext+0avgdata 2844maxresident)k
0inputs+0outputs (0major+9090931minor)pagefaults 0swaps

date &; date &; consistent pagefaults; runtime increased
1024
5.47user 26.63system 0:08.87elapsed 361%CPU (0avgtext+0avgdata 1668maxresident)k
0inputs+0outputs (0major+6931712minor)pagefaults 0swaps
7168
10.96user 27.08system 0:12.75elapsed 298%CPU (0avgtext+0avgdata 3028maxresident)k
0inputs+0outputs (0major+6735389minor)pagefaults 0swaps

copying 4.3 kernel over to my VM and will run some more tests

Comment 47 Milos Vyletel 2015-10-06 09:45:16 UTC

rhel7+4.3 kernel behaves the same way as f22+4.3 kernel.

I've tested rhel5 with newer kernel and interestingly pagefaults are still
consistent and so is the runtime.

[root@mv-rhel5 ~]# uname -r
3.2.70-1.el5.elrepo

1024
11.58user 41.37system 0:38.85elapsed 136%CPU (0avgtext+0avgdata 5952maxresident)k
0inputs+0outputs (0major+7677270minor)pagefaults 0swaps
2048
hrtimer: interrupt took 26992516 ns
10.71user 38.15system 0:36.91elapsed 132%CPU (0avgtext+0avgdata 5936maxresident)k
0inputs+0outputs (0major+7669473minor)pagefaults 0swaps
3072
10.98user 38.06system 0:36.53elapsed 134%CPU (0avgtext+0avgdata 5952maxresident)k
0inputs+0outputs (0major+7672588minor)pagefaults 0swaps
4096
11.12user 38.94system 0:36.98elapsed 135%CPU (0avgtext+0avgdata 5952maxresident)k
0inputs+0outputs (0major+7668470minor)pagefaults 0swaps
5120
10.14user 35.57system 0:34.24elapsed 133%CPU (0avgtext+0avgdata 5936maxresident)k
0inputs+0outputs (0major+7722695minor)pagefaults 0swaps
6144
11.13user 39.05system 0:37.18elapsed 134%CPU (0avgtext+0avgdata 5952maxresident)k
0inputs+0outputs (0major+7660873minor)pagefaults 0swaps
7168
11.13user 40.48system 0:38.03elapsed 135%CPU (0avgtext+0avgdata 5936maxresident)k
0inputs+0outputs (0major+7709850minor)pagefaults 0swaps

at this point I really don't think this is in kernel and neither in bash as
I've tested older bash on newer rhel and vice versa and results are the same

going to take a look at glibc now

Comment 48 Milos Vyletel 2015-10-06 12:56:31 UTC

Narrowed it down bit more to some change between fc7 and fc8. This was the
easiest and fastest way to test different glibc version by simply booting live CD
and running the test

[root@ftest ~]# uname -r
2.6.25-14.fc9.x86_64
[root@ftest ~]# rpm -q bash glibc kernel
bash-3.2-22.fc9.x86_64
glibc-2.8-3.x86_64
kernel-2.6.25-14.fc9.x86_64
[root@ftest ~]# cat new3.sh
#!/bin/sh
i=10000;
until test $i -eq 1; do
        MYVAR=$(date)
        /bin/sh &
        i=$((i-1));
done
[root@ftest ~]# for u in 1024 2048 3072 4096 5120 6144 7168; do ulimit -u $u; ulimit -u; /usr/ bin/time ./new3.sh; done
1024
14.65user 15.96system 0:21.87elapsed 139%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8392447minor)pagefaults 0swaps
2048
15.82user 16.87system 0:23.40elapsed 139%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8840644minor)pagefaults 0swaps
3072
19.53user 16.79system 0:24.57elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9192419minor)pagefaults 0swaps
4096
21.75user 17.77system 0:26.53elapsed 148%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9427010minor)pagefaults 0swaps
5120
22.38user 18.12system 0:27.81elapsed 145%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9652333minor)pagefaults 0swaps
6144
22.85user 20.13system 0:29.65elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9819271minor)pagefaults 0swaps
7168
23.76user 21.28system 0:31.33elapsed 143%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9944758minor)pagefaults 0swaps

[root@localhost ~]# uname -r
2.6.23.1-42.fc8
[root@localhost ~]# rpm -q bash glibc kernel
bash-3.2-18.fc8
glibc-2.7-2
glibc-2.7-2
kernel-2.6.23.1-42.fc8
[root@localhost ~]# cat new3.sh
#!/bin/sh
i=10000;
until test $i -eq 1; do
        MYVAR=$(date)
        /bin/sh &
        i=$((i-1));
done
[root@localhost ~]# for u in 1024 2048 3072 4096 5120 6144 7168; do ulimit -u $u; ulimit -u; / usr/bin/time ./new3.sh; done
1024
14.56user 17.97system 0:19.89elapsed 163%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8663077minor)pagefaults 0swaps
2048
18.72user 19.62system 0:26.97elapsed 142%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9043363minor)pagefaults 0swaps
3072
21.68user 19.67system 0:29.56elapsed 139%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9410641minor)pagefaults 0swaps
4096
22.90user 21.24system 0:32.09elapsed 137%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9625221minor)pagefaults 0swaps
5120
23.68user 22.45system 0:34.44elapsed 133%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9850296minor)pagefaults 0swaps
6144
23.77user 21.32system 0:33.16elapsed 136%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+10037024minor)pagefaults 0swaps
7168
24.41user 21.47system 0:33.76elapsed 135%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+10143371minor)pagefaults 0swaps

[root@localhost ~]# uname -r
2.6.21-1.3194.fc7
[root@localhost ~]# rpm -q bash glibc kernel
bash-3.2-9.fc7
glibc-2.6-3
glibc-2.6-3
kernel-2.6.21-1.3194.fc7
[root@localhost ~]# cat new3.sh
#!/bin/sh
i=10000;
until test $i -eq 1; do
        MYVAR=$(date)
        /bin/sh &
        i=$((i-1));
done
[root@localhost ~]# for u in 1024 2048 3072 4096 5120 6144 7168; do ulimit -u $u; ulimit -u; / usr/bin/time ./new3.sh; done
1024
14.02user 16.25system 0:20.62elapsed 146%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8421929minor)pagefaults 0swaps
2048
14.41user 16.46system 0:21.51elapsed 143%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8448431minor)pagefaults 0swaps
3072
14.27user 16.62system 0:21.28elapsed 145%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8438555minor)pagefaults 0swaps
4096
14.05user 16.17system 0:20.73elapsed 145%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8440796minor)pagefaults 0swaps
5120
13.71user 15.69system 0:20.51elapsed 143%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8426738minor)pagefaults 0swaps
6144
14.26user 16.42system 0:20.80elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8434620minor)pagefaults 0swaps
7168
14.13user 16.49system 0:20.71elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8458371minor)pagefaults 0swaps

Comment 49 Milos Vyletel 2015-10-06 13:46:22 UTC

This indeed seems to be glibc problem. I've installed fc7 and run the test again
after install and after yum update to latest fc7. At this point both kernel and
bash have been on newer versions than base fc8 and the only thing left at older
version was glibc. Performance was indeed the same. After upgrade to fc8 glibc
version we see the problem. Notes below

fc7 fresh install
[root@ftest ~]# rpm -q bash kernel glibc
bash-3.2-9.fc7
kernel-2.6.21-1.3194.fc7
glibc-2.6-3
glibc-2.6-3

(1000 iterations because it's same as last test)
[root@ftest ~]# for u in 1024 2048 3072 4096 5120 6144 7168; do ulimit -u $u; ulimit -u; /usr/ bin/time ./new3.sh; done
1024
1.09user 1.24system 0:01.79elapsed 130%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+792073minor)pagefaults 0swaps
2048
1.17user 1.44system 0:01.88elapsed 139%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+793115minor)pagefaults 0swaps
3072
1.07user 1.19system 0:01.75elapsed 129%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+791090minor)pagefaults 0swaps
4096
1.10user 1.31system 0:01.85elapsed 130%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+791077minor)pagefaults 0swaps
5120
1.06user 1.17system 0:01.75elapsed 127%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+790059minor)pagefaults 0swaps
6144
1.04user 1.17system 0:01.72elapsed 128%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+791083minor)pagefaults 0swaps
7168
1.06user 1.18system 0:01.73elapsed 129%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+791054minor)pagefaults 0swaps

yum update
[root@ftest ~]# rpm -q bash kernel glibc
bash-3.2-20.fc7
kernel-2.6.21-1.3194.fc7
kernel-2.6.23.17-88.fc7
glibc-2.6-4
glibc-2.6-4

[root@ftest ~]# for u in 1024 2048 3072 4096 5120 6144 7168; do ulimit -u $u; ulimit -u; /usr/ bin/time ./new3.sh; done
1024
14.56user 15.65system 0:19.42elapsed 155%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8546356minor)pagefaults 0swaps
2048
14.68user 16.00system 0:19.46elapsed 157%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8544516minor)pagefaults 0swaps
3072
15.02user 15.38system 0:19.44elapsed 156%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8543183minor)pagefaults 0swaps
4096
14.56user 15.61system 0:19.39elapsed 155%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8538693minor)pagefaults 0swaps
5120
15.14user 17.14system 0:21.33elapsed 151%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8537270minor)pagefaults 0swaps
6144
14.47user 16.78system 0:19.88elapsed 157%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8536516minor)pagefaults 0swaps
7168
14.59user 17.18system 0:19.77elapsed 160%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8541067minor)pagefaults 0swaps

yum update to fc8 glibc
Updating:
 glibc                   x86_64     2.7-2            glibc-2.7-2.x86_64.rpm   12 M
 glibc                   i686       2.7-2            glibc-2.7-2.i686.rpm   13 M
 glibc-common            x86_64     2.7-2            glibc-common-2.7-2.x86_64.rpm   86 M

[root@ftest ~]# rpm -q bash kernel glibc
bash-3.2-20.fc7
kernel-2.6.21-1.3194.fc7
kernel-2.6.23.17-88.fc7
glibc-2.7-2
glibc-2.7-2

[root@ftest ~]# for u in 1024 2048 3072 4096 5120 6144 7168; do ulimit -u $u; ulimit -u; /usr/ bin/time ./new3.sh; done
1024
14.53user 16.75system 0:19.90elapsed 157%CPU (0avgtext+0avgdata 0maxresident)k
16inputs+0outputs (0major+8618544minor)pagefaults 0swaps
2048
14.77user 17.40system 0:22.21elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8996006minor)pagefaults 0swaps
3072
19.52user 16.66system 0:25.91elapsed 139%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9369530minor)pagefaults 0swaps
4096
19.82user 19.43system 0:28.55elapsed 137%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9591477minor)pagefaults 0swaps
5120
21.35user 21.17system 0:31.10elapsed 136%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9786261minor)pagefaults 0swaps
6144
20.94user 19.87system 0:30.58elapsed 133%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9978023minor)pagefaults 0swaps
7168
22.24user 19.91system 0:31.51elapsed 133%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+10119779minor)pagefaults 0swaps

Comment 50 Milos Vyletel 2015-10-06 14:00:45 UTC

Looks like we went from glibc 2.6-4 straight to 2.6.90-1

* Tue Jul 31 2007 Jakub Jelinek <jakub> 2.6.90-1
- update to trunk
  - private futex optimizations
  - open{,at}{,64} argument checking
- ldconfig speedups

* Sun Jul 08 2007 Jakub Jelinek <jakub> 2.6-4
...

I'll poke around the git repo and see how many things changed tomorrow. my
guess is that quite a lot...

Comment 51 Milos Vyletel 2015-10-08 15:20:54 UTC

I did not have much luck trying to narrow this down. I've tried to bisect glibc
directly but it does not work very well since most versions between releases do
not compile and if they do they tend to crash.

What's even more weird is that while glibc-2.6-4.x86_64 gives me the expected
performance when I manually rebuild the srpm it does not work... I'm really not
sure why.

Comment 52 Carlos O'Donell 2015-10-10 05:28:40 UTC

(In reply to Milos Vyletel from comment #51)
> I did not have much luck trying to narrow this down. I've tried to bisect
> glibc
> directly but it does not work very well since most versions between releases
> do
> not compile and if they do they tend to crash.
> 
> What's even more weird is that while glibc-2.6-4.x86_64 gives me the expected
> performance when I manually rebuild the srpm it does not work... I'm really
> not
> sure why.

I suggest:

(1) With the old glibc that runs fast:

strace -tt -fff -o fast.log ./reproducer.sh

(2) With the new glibc that runs slow:

strace -tt -fff -o fast.log ./reproducer.sh

Then look at the logs for any serious differences in number of syscalls, syscall durations or syscall timings.

You're trying to see a pattern that is causing the slowdown.

Comment 53 Milos Vyletel 2015-10-12 11:11:29 UTC

Thanks for the suggestion Carlos. However since the recreator spawns tens of
thousand of threads it produces quite a few files to inspect

[root@ftest ~]# ls -1 fast/ | wc -l
19999

(9999 iterations + the script itself).

Instead I ran strace with -c option to get overall look at syscall stats. To
simplify things I've unpacked both glibc 2.6-4 and 2.7-2 locally so I don't
have to update rpms and used LD_LIBRARY_PATH to point to the one I wanted and
ran it in simple loop

for g in good bad; do
        echo $g;
        for u in 1024 7168; do
                ulimit -u $u;
                ulimit -u;
                LD_LIBRARY_PATH=/glibc/$g/lib64/ /usr/bin/time /root/new3.sh;
                LD_LIBRARY_PATH=/glibc/$g/lib64/ strace -tt -c /root/new3.sh;
        done;
done

Anyway, here are the results

glibc 2.6-3

good
1024
15.73user 21.45system 0:22.21elapsed 167%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8384029minor)pagefaults 0swaps
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 35.32    0.639919          32     19998           clone
 32.05    0.580559          15     39691     13706 wait4
 16.02    0.290260           1    229293           rt_sigprocmask
 12.43    0.225119          11     20576       571 read
  1.63    0.029522           1     30178       173 rt_sigaction
  1.01    0.018340           1     20007           close
  0.80    0.014534           1     19693       609 rt_sigreturn
  0.73    0.013135           1      9999           pipe
  0.01    0.000236         236         1           execve
  0.00    0.000018           3         6         3 stat
  0.00    0.000000           0        13         4 open
  0.00    0.000000           0         8           fstat
  0.00    0.000000           0         3           lseek
  0.00    0.000000           0        15           mmap
  0.00    0.000000           0         6           mprotect
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0         7           brk
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           dup2
  0.00    0.000000           0         1           getpid
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         2           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           getuid
  0.00    0.000000           0         1           getgid
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           getegid
  0.00    0.000000           0         1           getppid
  0.00    0.000000           0         1           getpgrp
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    1.811642                389511     15068 total
7168
15.99user 21.24system 0:22.17elapsed 167%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8428969minor)pagefaults 0swaps
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 34.99    0.631086          32     19998           clone
 32.27    0.582129          15     39628     13709 wait4
 16.37    0.295356           1    229219           rt_sigprocmask
 12.01    0.216624          11     20597       592 read
  1.35    0.024409           1     30192       187 rt_sigaction
  1.19    0.021405           1     19631       636 rt_sigreturn
  1.01    0.018223           1     20007           close
  0.79    0.014183           1      9999           pipe
  0.01    0.000249         249         1           execve
  0.00    0.000038           3        13         4 open
  0.00    0.000000           0         6         3 stat
  0.00    0.000000           0         8           fstat
  0.00    0.000000           0         3           lseek
  0.00    0.000000           0        15           mmap
  0.00    0.000000           0         6           mprotect
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0         7           brk
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           dup2
  0.00    0.000000           0         1           getpid
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         2           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           getuid
  0.00    0.000000           0         1           getgid
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           getegid
  0.00    0.000000           0         1           getppid
  0.00    0.000000           0         1           getpgrp
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    1.803702                389347     15133 total

and glibc 2.7-2
bad
1024
15.65user 20.61system 0:21.59elapsed 167%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8476290minor)pagefaults 0swaps
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 34.54    0.618362          31     19998           clone
 32.36    0.579311          15     39690     13627 wait4
 16.73    0.299493           1    229311           rt_sigprocmask
 11.85    0.212170          10     20507       502 read
  1.40    0.025021           1     20007           close
  1.15    0.020631           1     19692       528 rt_sigreturn
  1.15    0.020502           1     30203       198 rt_sigaction
  0.82    0.014665           1      9999           pipe
  0.00    0.000042           3        13         4 open
  0.00    0.000000           0         6         3 stat
  0.00    0.000000           0         8           fstat
  0.00    0.000000           0         3           lseek
  0.00    0.000000           0        15           mmap
  0.00    0.000000           0         6           mprotect
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0         6           brk
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           dup2
  0.00    0.000000           0         1           getpid
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         2           fcntl
  0.00    0.000000           0         2           getrlimit
  0.00    0.000000           0         1           getuid
  0.00    0.000000           0         1           getgid
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           getegid
  0.00    0.000000           0         1           getppid
  0.00    0.000000           0         1           getpgrp
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    1.790197                389483     14864 total
7168
25.41user 26.52system 0:37.14elapsed 139%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+10012168minor)pagefaults 0swaps
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 35.02    0.736181          37     19998           clone
 29.23    0.614465          15     39805     14056 wait4
 16.87    0.354757          17     20471       466 read
 15.16    0.318714           1    232595           rt_sigprocmask
  1.28    0.026908           1     30133       128 rt_sigaction
  1.15    0.024134           1     20007           close
  0.71    0.014986           1     19807       499 rt_sigreturn
  0.58    0.012125           1      9999           pipe
  0.00    0.000034           3        13         4 open
  0.00    0.000000           0         6         3 stat
  0.00    0.000000           0         8           fstat
  0.00    0.000000           0         3           lseek
  0.00    0.000000           0        15           mmap
  0.00    0.000000           0         6           mprotect
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0        31           brk
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           dup2
  0.00    0.000000           0         1           getpid
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         2           fcntl
  0.00    0.000000           0         2           getrlimit
  0.00    0.000000           0         1           getuid
  0.00    0.000000           0         1           getgid
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           getegid
  0.00    0.000000           0         1           getppid
  0.00    0.000000           0         1           getpgrp
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    2.102304                392916     15158 total

Comment 54 Florian Weimer 2015-10-14 09:32:57 UTC

POSIX requires that the shell keeps the exit status of at least as many as CHILD_MAX subshells.  CHILD_MAX is the value returned by sysconf (_SC_CHILD_MAX), and glibc returns the resource limit configured with “ulimit -u”.  (This means that the returned value can change due to a setrlimit call, which is a minor POSIX incompatibility, but this is a POSIX defect—there already is an exception for “ulimit -n”.)

Anyway, bash needs to keep track of CHILD_MAX processes, and if the script never collects the exit status of those subshells, it will keep accumulating them.  After “debuginfo-install bash”, you can verify this with:

$ gdb -p $(pgrep -u 500 -f ^sh.*new3) -ex "print js" -ex quit -batch

(assuming that the reproducer is invoked as “sh new3.sh” or something similar).  On my RHEL 6.7 test system, GDB prints this after the reproducer has been running for a while:

$1 = {c_childmax = 7810, c_living = 1, c_reaped = 6850, c_injobs = 6850, c_totforked = 13702, c_totreaped = 13701, j_jobslots = 6856, j_lastj = 6849, j_firstj = 0, j_njobs = 6850, j_ndead = 6850, j_current = 6849, j_previous = 6849, j_lastmade = 0x2071fe0, j_lastasync = 0x2071fe0}

The performance degradation happens because bash uses a linear array, and scans it for the child PID on each SIGCHILD.  This suggests that the amount of work is roughly proportional to I * min(I, CHILD_MAX), where I is the initial value of i in the reproducer (10000 in the attached version).

To verify this theory, I applied this patch:

--- bash-4.1/lib/sh/oslib.c.maxchild	2015-10-14 11:08:54.584433383 +0200
+++ bash-4.1/lib/sh/oslib.c	2015-10-14 11:09:13.063521809 +0200
@@ -275,22 +275,5 @@ getmaxgroups ()
 long
 getmaxchild ()
 {
-  static long maxchild = -1L;
-
-  if (maxchild > 0)
-    return maxchild;
-
-#if defined (HAVE_SYSCONF) && defined (_SC_CHILD_MAX)
-  maxchild = sysconf (_SC_CHILD_MAX);
-#else
-#  if defined (CHILD_MAX)
-  maxchild = CHILD_MAX;
-#  else
-#    if defined (MAXUPRC)
-  maxchild = MAXUPRC;
-#    endif /* MAXUPRC */
-#  endif /* CHILD_MAX */
-#endif /* !HAVE_SYSCONF || !_SC_CHILD_MAX */
-
-  return (maxchild);
+  return 500;
 }

This artificially caps the number of tracked subshells at 500.  As a result, the accumulation of subshells is limited:

$1 = {c_childmax = 500, c_living = 2, c_reaped = 501, c_injobs = 502, c_totforked = 2063, c_totreaped = 2061, j_jobslots = 1032, j_lastj = 1030, j_firstj = 529, j_njobs = 502, j_ndead = 501, j_current = 1030, j_previous = 1030, j_lastmade = 0x941380, j_lastasync = 0x941380}

and performance is independent of “ulimit -u”.

A proper fix would have to use a different data structure to perform PID lookups (a balanced tree or a hash table).  Without PID randomization, it might be possible to exploit their non-random nature and keep using the array-based data structure, but this will obviously explode again if we ever turn on PID randomization.  There will always be a slowdown due to the increased bookkeeping overhead, but as long as the number of PIDs that need to be tracked is not in the range of millions, I expect that it will be hardly measurable with the proper data structure.

We could add an override to glibc to change the value returned by sysconf (_SC_CHILD_MAX), but if you want to go that route (which I don't recommend because it might break other scripts), please put that into bash.

Comment 57 Florian Weimer 2015-10-14 10:51:25 UTC

(In reply to Christian Horn from comment #30)
> On rhel5, I have identical execution times for the code.  For "ulimit -u"
> ranging from 16 to 16284, I get execution times between 12.36 and 12.62
> seconds.

On RHEL 5, glibc is compiled against kernel headers, which contain this:

/usr/include/linux/limits.h:#define CHILD_MAX        999    /* no limit :-) */

The glibc sysconf implementation assumes this an actual hard limit.  As a result, sysconf (_SC_CHILD_MAX) will always return 999, and the “ulimit -u” value does not matter: bash will always track the most recent 999 subshells, and there is no performance impact of increasing “ulimit -u”.

Comment 70 Christian Horn 2015-10-16 11:23:09 UTC

Trying to sum up my thoughts about the options for going forward, with plusses/minuses as I see them (please correct!):

1) follow up with upstream glibc
    pro: I think this should be brought up in upsteam glibc in any way,
         as an upstream change triggered this.  I think also upstream
         had "CHILD_MAX 999", so other software might also be affected.
    pro: Maybe there is a chance on fixing this without touching
         the scripts
    con: Chance that reporting this is seen as noise

2) follow up with upstream bash
    pro: Should be brought up also in bash upstream, at not only our
         builds of bash had the issue.  Others might still use affected
         builds (like we until today).
    pro: Maybe there is a chance on fixing this without touching
         the scripts

3) follow up with the code, I think this has 2 parts:
    a) detecting affected code:
       Describing which situation is triggering the issue.
       Is this a code structure which can be detected with
       egrep and a regex?
       Or can it ina good way be seen when the script is running?
    b) If script is affected, recommending a fix for the code.
       And describing the workaround with "ulimit -u".
       We are only aware of the OSWatcher script so far, but it's
       widely spread.

With 1) and 2) we have a chance to solve the issue most conveniently for customers.  Applying the workaround with "ulimit -u" still requires that one detects that there is an issue.

Comment 71 Tomas Mraz 2015-10-16 11:38:26 UTC

I do not think 1) is a real solution and although I am not a glibc maintainer/developer I would reject any attempts to "fix" the problem this way.

The truth is the script is simply broken - spawning thousands of background processes and not collecting the exit status is a bug.

2) is not a real fix but at least it can be seen as an optimization enhancement and the patch could be acceptable to bash upstream.

Comment 72 Thomas Gardner 2015-10-17 19:33:10 UTC

I will go ahead and call this situation to Chet's attention, and see if maybe he wants to try to make the searches more efficient.  In fact, I'll just point him to this bug report.  Please don't anyone make this private or he won't be able to read it....

Comment 73 Christian Horn 2015-10-20 13:18:26 UTC

Thanks for bringing this to the attention of upstream glibc.  Is someone doing the same for bash, if that makes sense?

Review of https://access.redhat.com/solutions/1444773 summing up the issue would be appreachiated.  
Starting to discuss fixes of the code where this issue was observed.

Comment 74 Thomas Gardner 2015-10-20 15:43:49 UTC

OK, so I wrote to Chet.  He very graciously wrote back.  As I should
have expected, he was way ahead of us on this.  He already has plans
to fix it, and is just deciding on the best solution.  Here, I'll
just let him tell you in his own words (I've snipped out mine that
he quoted from the my original email to him, in case this seems a
little choppy to you):

> [...drivel from me snipped...]
> I will take a look at this.  I have one proposed patch and an idea for
> a simpler one in mind.
> 
> This was first discussed back in April:
> 
> http://lists.gnu.org/archive/html/bug-bash/2015-04/msg00069.html
> http://lists.gnu.org/archive/html/bug-bash/2015-04/msg00075.html
> 
> I applied the portion of the patch that only stores background pids
> in the bgpids list, though that won't help in the case where some
> process just forks off thousands of background pids and doesn't do
> anything with them.
> 
> [...drivel from me snipped...]
> 
> Sure.  A script can run `wait' without arguments, though (or `disown -a'
> if it doesn't want to possibly wait for a process to terminate), and
> discard all of those saved statuses.  Doing that seems just as reasonable,
> in the short term, as forking off and ignoring thousands of background
> processes, until a change gets into a bash release.
> 
> Chet

I love that disown thing he mentions.  Something I don't think any
of us noticed before (at least I don't recall seeing it mentioned
here, yet).

So, I haven't yet, but I will check those mailing list references
and see if there is a BZ or something in there that we can watch.
If there isn't one, I'll ask him so that we can have something to
watch.  From there, once he has made his decision and his patch,
we can then just port that into our version.  Does that seem like
a reasonable action plan to everyone?

BTW, should this now be considered an RFE, or still a bug?

Comment 75 Thomas Gardner 2015-10-20 17:05:26 UTC

(In reply to Christian Horn from comment #73)
> Thanks for bringing this to the attention of upstream glibc.  Is someone
> doing the same for bash, if that makes sense?
> 
> Review of https://access.redhat.com/solutions/1444773 summing up the issue
> would be appreachiated.  
> Starting to discuss fixes of the code where this issue was observed.

Sorry about that.  I guess I wasn't clear.  Chet Ramey is one of the original authors, and still maintainer of Bash.  I don't know who to contact about glibc.  I'm not sure if this is something to be addressed in glibc or not, really.  I have no idea what glibc _should_ be doing, or even if it is doing the right thing or not.  One thing is clear, though, is that the array that bash is maintaining _could_ be made to work more efficiently, and Chet is already looking into it, and seems to want to do it, so that will certainly fix all bash scripts.  The way I see it, I think the glibc folks can (and probably should) hash out what, if anything, should be done with that independently of whatever the bash folks decide.  I honestly have no opinion on that.  I'll look at your KCS as soon as I can, but I've already promised to do other stuff, so I'll get that done first.

Comment 76 Thomas Gardner 2015-10-21 16:21:27 UTC

Turns out there's a second problem, guys:  One of the things Chet points out above (though it was a little subtle for me to pick up on at first) is that this array is not only an inefficient way to keep and search for data, but it's also keeping data that it doesn't need to keep (thus making the array bigger than it has to be, thus adding to the problem).  Apparently, this array has been storing information not only on background processes, but also all foreground processes as well.  The information being kept about foreground processes when they exit is completely unnecessary.

Just to make things delicious, apparently, the patch to fix it is a simple one-liner and he already intends to include in a future release, so it's all upstreamy and everything already.  Are we interested in at least putting that in?  I've explained this to the customer of case #01443177, to see if they think it will help their specific oracle process.  My guess (having not seen the script, but heard things about its characteristics in this BZ and the case) is that it just might actually fix their problem, and also fix other scripts that simply run for very long periods of time, spawning reasonable numbers of background processes (instead of the greatly exaggerated numbers that we've been looking at in here).

Lemme go dig up the patch, hang on.

Comment 77 Thomas Gardner 2015-10-21 17:57:33 UTC

Quick patch Chet sent to me which just stops saving irrelevant data (data about foreground processes) in that bgpid array:

***************
*** 1015,1022 ****
      return;

!   if ((dflags & DEL_NOBGPID) == 0)
      {
        proc = find_last_proc (job_index, 0);
-       /* Could do this just for J_ASYNC jobs, but we save all. */
        if (proc)
        bgp_add (proc->pid, process_exit_status (proc->status));
--- 1066,1072 ----
      return;

!   if ((dflags & DEL_NOBGPID) == 0 && (temp->flags & (J_ASYNC|J_FOREGROUND)) == J_ASYNC)
      {
        proc = find_last_proc (job_index, 0);
        if (proc)
        bgp_add (proc->pid, process_exit_status (proc->status));

Comment 78 Christian Horn 2015-10-22 08:10:51 UTC

[actually 2 issues]

I think this fits to what we have seen.  We initially had issues to describe this here in the best way, actually seeing multiple things:
- having the downstripped code which was slower on rhel5 than on rhel6.2+
- seing the script after some time occupying a cpu core by itself
- seeing the script in "top" accumulating cpu cycles much faster than on rhel5

The customer probably agrees on giving a testpackage a try.  Clear reesults would take 1 or 2 weeks after installing the testpackage, I think.

Comment 79 Ilan Green 2015-10-22 08:37:43 UTC

Counting exec or clone calls in the straces made on site
- sample 1 2860 exec (2802 clone) calls in 920 seconds 
- sample 2 1287 exec (898 clone) calls in 364 seconds

Comment 82 Christian Horn 2015-12-16 13:13:16 UTC

Has this discussion been taken up in upstream, any news?

Comment 83 Thomas Gardner 2015-12-17 11:21:05 UTC

(In reply to Christian Horn from comment #82)
> Has this discussion been taken up in upstream, any news?

I thought I reported back on that long ago:  It has been at least partially fixed upstream (the bit about not saving unnecessary data in that structure).  The last time I talked to him, he still wasn't sure about which way he was going to try to approach speeding up searches, but he certainly intended to do it.  That's all I know.

Comment 88 Siteshwar Vashisht 2016-07-26 10:56:32 UTC

Discussed at http://lists.gnu.org/archive/html/bug-bash/2015-04/msg00069.html

Fixed in upstream through http://git.savannah.gnu.org/cgit/bash.git/commit/?id=85ec0778f9d778e1820fb8c0e3e996f2d1103b45

jobs.c
       - delete_job: only add J_ASYNC (and not J_FOREGROUND) jobs to the
         bgpids array, since Posix says shells only need to save the status
         of async jobs.  Report and fix from John Fremlin <john>

Related change:

85ec0778 (Chet Ramey 2015-05-15 11:03:36 -0400 1155)   if ((dflags & DEL_NOBGPID) == 0 && (temp->flags & (J_ASYNC|J_FOREGROUND)) == J_ASYNC)

Note You need to log in before you can comment on or make changes to this bug.