| Summary: | AIM7 runs 2X faster on 2.6.32.y than RHEL6.1 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Russ Anderson <randerso> | ||||||||
| Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 6.2 | CC: | bmarson, czhang, dwa, eguan, esandeen, gbeshers, jmoyer, kchin, kzhang, lczerner, lwoodman, martinez, perfbz, peterm, rja, rwheeler, swells, vincent | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2011-09-30 19:56:34 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 688933, 697639 | ||||||||||
| Attachments: |
|
||||||||||
So is this a 6.1 regression compared to 6.0 or is 6.x slower than RHEL5 in general??? Larry Woodman I'm not sure offhand the performance on rhel 6.0 or rhel 5. I'll see if I can get those numbers. Created attachment 512274 [details]
NMI stack trace for Aim7 on medium size UV system
Partial all cpu stacktrace while running Aim7 at 2000 tasks
on 512cpus.
This list is representative, I eliminated idle cpu and didn't
include a significant number of duplicates.
George
This patch looks like its related to the NMI stacktraces George posted in comment #4: commit c35a56a090eacefca07afeb994029b57d8dd8025 Author: Theodore Ts'o <tytso> Date: Sun May 16 05:00:00 2010 -0400 jbd2: Improve scalability by not taking j_state_lock in jbd2_journal_stop() One of the most contended locks in the jbd2 layer is j_state_lock when running dbench. This is especially true if using the real-time kernel with its "sleeping spinlocks" patch that replaces spinlocks with priority inheriting mutexes --- but it also shows up on large SMP benchmarks. Thanks to John Stultz for pointing this out. Reviewed by Mingming Cao and Jan Kara. Signed-off-by: "Theodore Ts'o" <tytso> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index bfc70f5..e214d68 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -1311,7 +1311,6 @@ int jbd2_journal_stop(handle_t *handle) if (handle->h_sync) transaction->t_synchronous_commit = 1; current->journal_info = NULL; - spin_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); transaction->t_outstanding_credits -= handle->h_buffer_credits; transaction->t_updates--; @@ -1340,8 +1339,7 @@ int jbd2_journal_stop(handle_t *handle) jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); /* This is non-blocking */ - __jbd2_log_start_commit(journal, transaction->t_tid); - spin_unlock(&journal->j_state_lock); + jbd2_log_start_commit(journal, transaction->t_tid); /* * Special case: JBD2_SYNC synchronous updates require us @@ -1351,7 +1349,6 @@ int jbd2_journal_stop(handle_t *handle) err = jbd2_log_wait_commit(journal, tid); } else { spin_unlock(&transaction->t_handle_lock); - spin_unlock(&journal->j_state_lock); } lock_map_release(&handle->h_lockdep_map); diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index aad50fe..b8e0806 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -1314,7 +1314,6 @@ int jbd2_journal_stop(handle_t *handle) if (handle->h_sync) transaction->t_synchronous_commit = 1; current->journal_info = NULL; - spin_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); transaction->t_outstanding_credits -= handle->h_buffer_credits; transaction->t_updates--; @@ -1343,8 +1342,7 @@ int jbd2_journal_stop(handle_t *handle) jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); /* This is non-blocking */ - __jbd2_log_start_commit(journal, transaction->t_tid); - spin_unlock(&journal->j_state_lock); + jbd2_log_start_commit(journal, transaction->t_tid); /* * Special case: JBD2_SYNC synchronous updates require us @@ -1354,7 +1352,6 @@ int jbd2_journal_stop(handle_t *handle) err = jbd2_log_wait_commit(journal, tid); } else { spin_unlock(&transaction->t_handle_lock); - spin_unlock(&journal->j_state_lock); } lock_map_release(&handle->h_lockdep_map); 78,1 Bot The commit in the previous comment is not in 2.6.32.y, so I'm not sure it explains the performance difference. It'd be worth testing, though.
The patch actually seems to have made things significantly worse.
(NOTE, -165 kernel is base because there is a UV regression in -166
[root@uvsw-sys aim7]# /usr/bin/time --verbose ./runt "2.6.32-165 unpatched"
---------------------------------------------------------------------------
Linux version 2.6.32-165.el6.x86_64 (mockbuild.bos.redhat.com) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Fri Jul 1 13:16:59 EDT 2011
DATE = Mon Jul 18 08:43:21 CDT 2011
ARGS = -f -nl -y -D3600 2.6.32-165 unpatched
HOST = uvsw-sys
CPUS = 256
DIRS = 2
DISKS= 0
FS = ext4
CMDLINE = ro root=LABEL=uvsw-sysR14 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=512M virtefi selinux=0 nmi_watchdog=0 add_efi_memmap nortsched processor.max_cstate=1 log_buf_len=8M pci=hpiosize=0,hpmemsize=0,nobar nohz=off cgroup_disable=memory earlyprintk=ttyS0,115200n8 pcie_aspm=on nosoftlockup console=ttyS0,115200n8
ID = 2.6.32-165 unpatched
Run 1 of 1
AIM Multiuser Benchmark - Suite VII v1.1, January 22, 1996
Copyright (C) 1996 AIM Technology
All Rights Reserved
Datapoint file :
HZ is <100>
AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
1 510.08 100 510.0789 11.41 1.39 Mon Jul 18 08:43:33 2011
2 1030.09 99 515.0442 11.30 2.53 Mon Jul 18 08:43:45 2011
3 1526.22 99 508.7413 11.44 4.18 Mon Jul 18 08:43:57 2011
4 2034.97 99 508.7413 11.44 5.58 Mon Jul 18 08:44:09 2011
5 2528.24 99 505.6473 11.51 7.17 Mon Jul 18 08:44:20 2011
10 4511.63 98 451.1628 12.90 25.16 Mon Jul 18 08:44:33 2011
20 7320.75 96 366.0377 15.90 89.20 Mon Jul 18 08:44:50 2011
50 8749.25 92 174.9850 33.26 1055.27 Mon Jul 18 08:45:23 2011
100 10878.50 92 108.7850 53.50 3985.79 Mon Jul 18 08:46:17 2011
150 11308.29 92 75.3886 77.20 8875.31 Mon Jul 18 08:47:34 2011
200 11808.87 91 59.0443 98.57 14908.60 Mon Jul 18 08:49:13 2011
500 12690.24 81 25.3805 229.31 47362.00 Mon Jul 18 08:53:03 2011
1000 13056.65 78 13.0566 445.75 101236.98 Mon Jul 18 09:00:29 2011
2000 13244.13 75 6.6221 878.88 197502.93 Mon Jul 18 09:15:09 2011
============================================================================
[root@uvsw-sys aim7]# /usr/bin/time --verbose ./runt "2.6.32-165.bz713953"
---------------------------------------------------------------------------
Linux version 2.6.32-165.el6.bz713953.x86_64 (root.bos.redhat.com) (gcc vsion 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Mon Jul 18 08:49:11 EDT 2011
DATE = Mon Jul 18 11:26:03 CDT 2011
ARGS = -f -nl -y -D3600 2.6.32-165.bz713953
HOST = uvsw-sys
CPUS = 256
DIRS = 2
DISKS= 0
FS = ext4
CMDLINE = ro root=LABEL=uvsw-sysR14 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=512M virtefi selinux=0 n_watchdog=0 add_efi_memmap nortsched processor.max_cstate=1 log_buf_len=8M pci=hpiosize=0,memsize=0,nobar nohz=off cgroup_disable=memory earlyprintk=ttyS0,115200n8 pcie_aspm=on noftlockup console=ttyS0,115200n8
ID = 2.6.32-165.bz713953
Run 1 of 1
AIM Multiuser Benchmark - Suite VII v1.1, January 22, 1996
Copyright (C) 1996 AIM Technology
All Rights Reserved
Datapoint file :
HZ is <100>
AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
1 512.78 100 512.7753 11.35 1.35 Mon Jul 18 11:26:16 2011
2 1038.36 99 519.1793 11.21 2.37 Mon Jul 18 11:26:27 2011
3 1511.69 99 503.8961 11.55 4.49 Mon Jul 18 11:26:39 2011
4 1943.24 99 485.8097 11.98 7.49 Mon Jul 18 11:26:52 2011
5 2333.60 99 466.7201 12.47 12.13 Mon Jul 18 11:27:04 2011
10 3798.96 99 379.8956 15.32 50.38 Mon Jul 18 11:27:20 2011
20 4752.96 99 237.6480 24.49 287.17 Mon Jul 18 11:27:45 2011
50 4990.57 99 99.8114 58.31 2397.91 Mon Jul 18 11:28:43 2011
100 5860.44 99 58.6044 99.31 8775.86 Mon Jul 18 11:30:23 2011
150 6158.30 99 41.0553 141.76 19220.22 Mon Jul 18 11:32:45 2011
200 6293.59 98 31.4680 184.95 33049.52 Mon Jul 18 11:35:50 2011
500 6602.38 84 13.2048 440.75 100128.64 Mon Jul 18 11:43:12 2011
1000 6702.52 81 6.7025 868.33 207472.07 Mon Jul 18 11:57:41 2011
Created attachment 513655 [details]
Aim7 used in testing
I used a simple ./runt as shown in the posted results.
Created attachment 513656 [details]
NMI stack trace of 2.6.32-165 with proposed patch applied.
I posted the 2 patches that Shak verified fixes this problem:
1.)
commit c35a56a090eacefca07afeb994029b57d8dd8025
Author: Theodore Ts'o <tytso>
Date: Sun May 16 05:00:00 2010 -0400
jbd2: Improve scalability by not taking j_state_lock in jbd2_journal_stop()
One of the most contended locks in the jbd2 layer is j_state_lock when
running dbench. This is especially true if using the real-time kernel
with its "sleeping spinlocks" patch that replaces spinlocks with
priority inheriting mutexes --- but it also shows up on large SMP
benchmarks.
Thanks to John Stultz for pointing this out.
Reviewed by Mingming Cao and Jan Kara.
Signed-off-by: "Theodore Ts'o" <tytso>
2.)
commit 965f55dea0e331152fa53941a51e4e16f9f06fae
Author: Shaohua Li <shaohua.li>
Date: Tue May 24 17:11:20 2011 -0700
mmap: avoid merging cloned VMAs
Avoid merging a VMA with another VMA which is cloned from the parent process.
The cloned VMA shares the anon_vma lock with the parent process's VMA. If
we do the merge, more vmas (even the new range is only for current
process) use the perent process's anon_vma lock. This introduces
scalability issues. find_mergeable_anon_vma() already considers this
case.
Signed-off-by: Shaohua Li <shaohua.li>
Cc: Rik van Riel <riel>
Cc: Hugh Dickins <hughd>
Cc: Andi Kleen <andi>
Signed-off-by: Andrew Morton <akpm>
Signed-off-by: Linus Torvalds <torvalds>
AIM7 runs 2X faster on 2.6.32.y than RHEL6.1
https://bugzilla.redhat.com/show_bug.cgi?id=713953
Summary: 1.9x speedup with AIM7 on FusionIO
Tasks jobs/min jti jobs/min/task real cpu
RHEL6.1 10000 265289.15 64 26.5289 228.43 12142.19 Sat Jun 12 22:14:34 2010
Larry6.2 10000 504117.79 69 50.4118 120.21 7697.32 Thu Jul 21 14:28:36 2011
Larry6.2 == 2.6.32-169.el6.andi.x86_64
I will post to the BZ but wanted to give quick feedback to Larry/all,
Details:
RHEL6.1 2.6.32-131.0.15.el6.x86_64
AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
1 397.64 100 397.6378 15.24 5.24 Sat Jun 12 21:52:06 2010
101 21355.90 94 211.4445 28.66 1131.42 Sat Jun 12 21:52:35 2010
201 42530.03 94 211.5922 28.64 1132.11 Sat Jun 12 21:53:04 2010
301 65874.32 91 218.8516 27.69 1019.74 Sat Jun 12 21:53:32 2010
401 85205.47 90 212.4825 28.52 1084.43 Sat Jun 12 21:54:01 2010
501 111907.85 88 223.3690 27.13 981.28 Sat Jun 12 21:54:29 2010
601 119844.03 87 199.4077 30.39 1155.42 Sat Jun 12 21:54:59 2010
701 133460.89 85 190.3864 31.83 1211.88 Sat Jun 12 21:55:31 2010
905 149885.21 84 165.6190 36.59 1473.14 Sat Jun 12 21:56:08 2010
1343 191360.92 77 142.4877 42.53 1797.61 Sat Jun 12 21:56:51 2010
2290 218472.92 68 95.4030 63.52 2927.11 Sat Jun 12 21:57:55 2010
2669 222968.57 70 83.5401 72.54 3360.75 Sat Jun 12 21:59:09 2010
3480 237406.28 67 68.2202 88.83 4342.04 Sat Jun 12 22:00:39 2010
4291 243455.29 66 56.7363 106.81 5299.78 Sat Jun 12 22:02:27 2010
5102 249782.84 66 48.9578 123.78 6263.25 Sat Jun 12 22:04:32 2010
6852 258582.14 65 37.7382 160.58 8388.74 Sat Jun 12 22:07:14 2010
8602 252509.78 64 29.3548 206.44 10774.35 Sat Jun 12 22:10:43 2010
10000 265289.15 64 26.5289 228.43 12142.19 Sat Jun 12 22:14:34 2010
AIM Multiuser Benchmark - Suite VII
With 2.6.32-169.el6.andi.x86_64 (4 of the 5 patches)
AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
1 483.25 100 483.2536 12.54 2.53 Thu Jul 21 14:15:22 2011
101 53548.56 98 530.1837 11.43 78.82 Thu Jul 21 14:15:34 2011
201 97056.57 96 482.8685 12.55 152.97 Thu Jul 21 14:15:47 2011
301 134716.40 94 447.5628 13.54 233.41 Thu Jul 21 14:16:00 2011
401 164082.38 92 409.1830 14.81 304.59 Thu Jul 21 14:16:15 2011
501 193256.52 91 385.7416 15.71 384.27 Thu Jul 21 14:16:31 2011
601 217436.42 90 361.7910 16.75 463.08 Thu Jul 21 14:16:48 2011
701 237853.30 88 339.3057 17.86 539.16 Thu Jul 21 14:17:06 2011
907 272910.63 86 300.8937 20.14 695.99 Thu Jul 21 14:17:26 2011
1345 327864.04 83 243.7651 24.86 1032.19 Thu Jul 21 14:17:51 2011
1783 362704.93 80 203.4240 29.79 1366.86 Thu Jul 21 14:18:21 2011
2745 412976.66 77 150.4469 40.28 2104.72 Thu Jul 21 14:19:02 2011
3153 427358.09 76 135.5401 44.71 2417.29 Thu Jul 21 14:19:47 2011
4013 448685.98 74 111.8081 54.20 3080.88 Thu Jul 21 14:20:41 2011
4873 464972.13 73 95.4180 63.51 3735.04 Thu Jul 21 14:21:45 2011
6740 484110.47 71 71.8265 84.37 5165.47 Thu Jul 21 14:23:10 2011
7541 492652.65 70 65.3299 92.76 5801.03 Thu Jul 21 14:24:44 2011
9222 503471.35 69 54.5946 111.00 7072.96 Thu Jul 21 14:26:35 2011
10000 504117.79 69 50.4118 120.21 7697.32 Thu Jul 21 14:28:36 2011
(Thanks a lot, Larry!) Russ -- Could you please test and verify those patches as well? Thanks! The results look good.
---------------------------------------------------------------------------
Linux version 2.6.32-131.0.15.el6.x86_64 (mockbuild.bos.redhat.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Tue May 10 15:42:40 EDT 2011
DATE = Mon Sep 26 10:21:57 CDT 2011
ARGS = -f -nl -y -D3600 2.6.32-131.0.15.el6.x86_64
HOST = uvmid5-sys
CPUS = 48
DIRS = 1
DISKS= 0
FS = ext4
CMDLINE = ro root=LABEL=mid5-sysR14 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=512M virtefi selinux=0 nmi_watchdog=0 add_efi_memmap nortsched processor.max_cstate=1 log_buf_len=8M pci=hpiosize=0,hpmemsize=0,nobar nohz=off cgroup_disable=memory earlyprintk=ttyS0,115200n8 pcie_aspm=on console=ttyS0,115200n8
ID = 2.6.32-131.0.15.el6.x86_64
Run 1 of 1
AIM Multiuser Benchmark - Suite VII v1.1, January 22, 1996
Copyright (C) 1996 AIM Technology
All Rights Reserved
Datapoint file :
HZ is <100>
AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
1 523.85 100 523.8524 11.11 1.12 Mon Sep 26 10:22:08 2011
2 1057.22 99 528.6104 11.01 2.02 Mon Sep 26 10:22:20 2011
3 1516.94 99 505.6473 11.51 4.49 Mon Sep 26 10:22:31 2011
4 1989.74 99 497.4359 11.70 6.49 Mon Sep 26 10:22:43 2011
5 2393.09 99 478.6184 12.16 10.51 Mon Sep 26 10:22:55 2011
10 3801.44 99 380.1437 15.31 37.79 Mon Sep 26 10:23:11 2011
20 6388.58 99 319.4292 18.22 159.16 Mon Sep 26 10:23:30 2011
50 7987.92 99 159.7584 36.43 1251.69 Mon Sep 26 10:24:06 2011
100 9953.82 97 99.5382 58.47 2308.86 Mon Sep 26 10:25:05 2011
150 10846.07 96 72.3071 80.49 3360.18 Mon Sep 26 10:26:26 2011
200 11373.85 94 56.8693 102.34 4416.56 Mon Sep 26 10:28:08 2011
500 12459.86 90 24.9197 233.55 10707.00 Mon Sep 26 10:32:02 2011
1000 12882.38 87 12.8824 451.78 21166.39 Mon Sep 26 10:39:34 2011
2000 13070.13 82 6.5351 890.58 42207.51 Mon Sep 26 10:54:25 2011
4000 13097.86 79 3.2745 1777.39 84693.22 Mon Sep 26 11:24:04 2011
8000 13153.32 76 1.6442 3539.79 169012.14 Mon Sep 26 12:23:06 2011
RHEL6.2 (development) kernel
---------------------------------------------------------------------------
[root@uvmid5-sys aim7]# /usr/bin/time --verbose ./runt "2.6.32-71.el6.x86_64.uv"
---------------------------------------------------------------------------
Linux version 2.6.32-71.el6.x86_64.uv (abuild@alcatraz) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Fri Sep 23 01:12:00 EDT 2011
DATE = Mon Sep 26 17:47:51 CDT 2011
ARGS = -f -nl -y -D3600 2.6.32-71.el6.x86_64.uv
HOST = uvmid5-sys
CPUS = 48
DIRS = 1
DISKS= 0
FS = ext4
CMDLINE = ro root=LABEL=mid5-sysR14 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=512M virtefi selinux=0 nmi_watchdog=0 add_efi_memmap nortsched processor.max_cstate=1 log_buf_len=8M pci=hpiosize=0,hpmemsize=0,nobar nohz=off cgroup_disable=memory earlyprintk=ttyS0,115200n8 pcie_aspm=on console=ttyS0,115200n8
ID = 2.6.32-71.el6.x86_64.uv
Run 1 of 1
AIM Multiuser Benchmark - Suite VII v1.1, January 22, 1996
Copyright (C) 1996 AIM Technology
All Rights Reserved
Datapoint file :
HZ is <100>
AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
1 522.44 100 522.4417 11.14 1.12 Mon Sep 26 17:48:02 2011
2 1058.18 99 529.0909 11.00 1.98 Mon Sep 26 17:48:13 2011
3 1567.32 99 522.4417 11.14 3.37 Mon Sep 26 17:48:25 2011
4 2056.54 99 514.1343 11.32 4.88 Mon Sep 26 17:48:36 2011
5 2554.87 99 510.9745 11.39 6.78 Mon Sep 26 17:48:48 2011
10 4667.20 97 466.7201 12.47 19.21 Mon Sep 26 17:49:00 2011
20 8214.54 96 410.7269 14.17 40.80 Mon Sep 26 17:49:14 2011
50 13049.33 99 260.9865 22.30 574.54 Mon Sep 26 17:49:37 2011
100 17946.35 97 179.4635 32.43 1063.28 Mon Sep 26 17:50:09 2011
150 19800.41 94 132.0027 44.09 1620.03 Mon Sep 26 17:50:54 2011
200 21393.13 90 106.9656 54.41 2111.85 Mon Sep 26 17:51:48 2011
500 24982.83 86 49.9657 116.48 5087.39 Mon Sep 26 17:53:44 2011
1000 26419.72 80 26.4197 220.29 10049.33 Mon Sep 26 17:57:25 2011
2000 27218.52 77 13.6093 427.65 19991.90 Mon Sep 26 18:04:33 2011
---------------------------------------------------------------------------
Great! Hey Larry -- I believe you can take it from here. Thanks everyone! What else do I need to do? The patches are in 6.2. Larry I guess this is a Q for Aris then. This BZ is still on POST, which would indicate they have not been committed to the RHEL 6.2 tree yet. I'll send him a note. Thanks! Sorry for the confusion here I opened 2 separate BZs(721044 & 725855) for these 2 separate patches since they also fixed other reported problems that reported by Intel on 2 separate occasions. Together they also fixed BZ713953. These are the 2 commits: commit b562ced54d5e23f86ef8523f2e6e87d8e4a8e5d7 Author: Larry Woodman <lwoodman> Date: Tue Jul 26 18:37:51 2011 -0400 [fs] jbd2: Improve scalability by not taking j_state_lock in jbd2_journal_stop() Message-id: <4E2F097F.3070700> Patchwork-id: 39103 O-Subject: [RHEL6.2 V2 Patch] jbd2: Improve scalability by not taking j_state_lock in jbd2_journal_stop() Bugzilla: 721044 RH-Acked-by: Eric Sandeen <sandeen> RH-Acked-by: Rik van Riel <riel> fixes BZ721044 commit c35a56a090eacefca07afeb994029b57d8dd8025 Author: Theodore Ts'o <tytso> Date: Sun May 16 05:00:00 2010 -0400 commit 347d4e7a137ff704bb8fa0ec66155f2ef068e869 Author: Larry Woodman <lwoodman> Date: Tue Jul 12 20:38:26 2011 -0400 [mm] Avoid merging a VMA with another VMA which is cloned from the parent process. Message-id: <4E1CB0C2.5050004> Patchwork-id: 37430 O-Subject: [RHEL6.2 Patch] Avoid merging a VMA with another VMA which is cloned from the parent process. Bugzilla: 725855 RH-Acked-by: Rik van Riel <riel> RH-Acked-by: Johannes Weiner <jweiner> During weekly partner performance meetings it was brought to our attention that RHEL6 is missing this upstream performance optimization that avoids merging some VMAs which are cloned from the parent process: commit 965f55dea0e331152fa53941a51e4e16f9f06fae Author: Shaohua Li <shaohua.li> Date: Tue May 24 17:11:20 2011 -0700 mmap: avoid merging cloned VMAs No worries! Closing this item as a DUP then. *** This bug has been marked as a duplicate of bug 725855 *** |
Description of problem: AIM7 runs 2X faster on community 2.6.32.y than RHEL6.1. Testing was done on a 48p system using 8 RAM-back tmpfs file systems. The performance degradation in RHEL6.1 appears to be related to handling of anon_vma locks. The problem affects workloads that do frequent fork's w/o exec's, ie. they build large trees of shared ANON space. AIM7 is a prime example of this type of workload. Other similar workloads are some file servers, mail servers, web servers, etc. The problem is not unique to SGI UV. Any multi-socket system will likely show a performance regression on RHEL6.1 running fork intensive workloads. SuSE sles11sp1 performs similar to community 2.6.33. Version-Release number of selected component (if applicable): RHEL 6.1. How reproducible: 100%. Steps to Reproduce: 1. Run AIM7. Actual results: AIM7 war run on a 48p system on both 2.6.32.y & RHEL6.1. Results show that 2.6.32.y is ~2X faster for jobs/sec at higher loads. Tasks Jobs/Min Jobs/Min 2.6.32.y RHEL6.1 ---- ------- ------- 1 526 521 5 2678 2658 10 5352 5332 20 10704 10644 50 26340 24338 100 48387 36655 200 86429 61990 500 162393 98992 1000 227091 123617 2000 279274 153225 4000 321625 164978 8000 340756 171081 16000 354395 185592 32000 356535 192537 * Higher is better Expected results: Performance comparable to community 2.6.32.y. Additional info: AFAICT, the performance degradation in RHEL6.1 is caused by a very hot anon_vma spinlock. Running with a user load of 1000, kernel profiling (perf top) shows long periods of 50%-70% of the time in _spin_lock. The same run on 2.6.33 rarely shows more than ~10% in _spin_lock. An NMI on RHEL6 shows numerous tasks in _spin_lock with the following backtraces: [<ffffffff814ddae1>] ? _spin_lock+0x21/0x30 [<ffffffff811407f4>] ? unlink_anon_vmas+0x94/0xd0 [<ffffffff81133f94>] ? free_pgtables+0x44/0x120 [<ffffffff8113cc3d>] ? unmap_region+0xcd/0x130 [<ffffffff8113d2c6>] ? do_munmap+0x2b6/0x3a0 [<ffffffff8113d8a1>] ? sys_brk+0x121/0x130 [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b [<ffffffff814ddae1>] ? _spin_lock+0x21/0x30 [<ffffffff811402d5>] ? anon_vma_chain_link+0x35/0x60 [<ffffffff8114087d>] ? anon_vma_clone+0x4d/0x90 [<ffffffff8113be7d>] ? __split_vma+0xcd/0x280 [<ffffffff8113d197>] ? do_munmap+0x187/0x3a0 [<ffffffff810d1b62>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff8113d8a1>] ? sys_brk+0x121/0x130 [<ffffffff8100bc8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b A google search for "linux aim7 anon:" shows numerous references to what appears to be this same problem: http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.27.36 http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-09/msg06653.html http://linuxkernelpanic.blogspot.com/2010/05/while-getting-in-touch-recently-with-ex.html http://comments.gmane.org/gmane.linux.kernel.mm/62645 The code in 2.6.32.y & RHEL6.1 for handling anon VMAs is different. Neither matches recent upstream kernels. The above links include a patch that is suppose to address a scaling issue with anon locks. Both SLES & RH appears to have a variant of the patch but neither matches exactly. However, more analysis is required. To get further data on the difference, I ran several different kernels. The runs were on a different hardware system (uvmid1) & the results are close but cannot be directly compared to the results above. The following is a single datapoint at 1000 users. Full AIM7 curves were not run. OS-VERSION Task/sec Wall CPU ---------- ------ ---- ------ RHEL6.1 134751 42.3 1535.8 SLES11SP1 210099 27.1 712.6 2.6.32.y 212845 26.8 772.0 2.6.39 155398 36.6 1101.0 3.0.0-rc2 127831 44.6 1103.4 Looks like a definite regression in RHEL6.1. Both SLES11SP1 & RHEL6.1 are based on the 2.6.32 kernel. The community 2.6.32.y kernel (the base for both distros) has performance similar to SLES11SP1. More recent upstream kernels have changed the code related to ANON VMAs & show regressions. In fact, there was mail yesterday about a very recent regression in this area for 3.0.0: Subject: REGRESSION: Performance regressions from switching anon_vma->lock to mutex It seems like that the recent changes to make the anon_vma->lock into a mutex (commit 2b575eb6) causes a 52% regression in throughput (2.6.39 vs 3.0-rc2) on exim mail server workload in the MOSBENCH test suite. Our test setup is on a 4 socket Westmere EX system, with 10 cores per socket. 40 clients are created on the test machine which send email to the exim server residing on the sam test machine. There is an ongoing community discussion about this 3.0.0 regression. FWIW, upstream folks claim that the locking issues are seen on 4-socket whiteboxes, but not 2-socket whiteboxes.