From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050921 Red Hat/1.7.12-1.4.1 Description of problem: The initial data show some regression from U2 to U3 beta is -15% for the aim7 fileserver workload. The delta correlates with amount of I/O performed in the workload. System is a 16-cpu IPF HP rx8620, 64 GB memory w/ 12 Fiber Controllers and 144 ext3 filesystem spread over 144 luns over 6 MSA1000 FC storage arrays. Version-Release number of selected component (if applicable): kernel-2.6.9-27.ELsmp How reproducible: Always Steps to Reproduce: 1. Configure 16-cpu, 64 GB, 144 luns on 12 FC w/ 6 MSA1000. 2. Run AIM7 fileserver mix 3. Actual Results: HP does regression testing since GA on RHEL3 and RHEL4, threshold for reporting is -3%. Expected Results: Within 3-5% unless understood by kernel changes. Additional info: See attached performance graph with many RHEL4 regression data points.
Created attachment 123124 [details] AIM7 regression tests
I marked this bug as a regression.
Created attachment 123175 [details] Oprofile data on U3 - AIM7 fserver point run @ 2000
Created attachment 123176 [details] oprofile data on U2 - AIM7 fserver point run @2000
Created attachment 123177 [details] lockstat data on U3 - AIM7 fserver point run @2000
Created attachment 123178 [details] lockstat data on U2 - AIM7 fserver point run @ 2000
We did point runs of the AIM7 fserver workload at 2000 jobs both on RHEL4U2 and RHEL4U3 and ran oprofile. We also built lockstat-enabled kernels and gathered lockstats on point runs - we see no major changes in lock contention. See attachments. We also tried a run with the U2 qla2xxx driver in place of the U3 one. Preliminary results show no major difference (maybe 1.5%).
Also, any idea if this regression is ia64 specific?
No idea. The testing so far has only been on an ia64, but we believe that the regression is not ia64-specific. The qla2xxx driver is still our #1 suspect (notwithstanding the preliminary results I mentioned in comment #9 - we are planning to test the driver more carefully in the next day or two).
Yes I am back attached are some IOzone and lmbench results measured on HP gear RHEL4 U3 vs U2 ... clearly we are not able to reproduce your regressions to 1,2 or 4 file systems IOzone R4_U2 EXT3 R4_U3_EXT3 %Diff Writer 70347 70234 99.8% Re-writer 88214 88213 100.0% Reader 85973 85506 99.5% Re-reader 89678 90016 100.4% Random Read 87647 87740 100.1% Random Write 99174 98984 99.8% Backward Read 83299 83587 100.3% Record Rewrite 103825 104428 100.6% Stride Read 87594 87792 100.2% Overall GeoMean 87956 88028 100.1% R4_U2 EXT3 R4_U3_EXT3 %Diff Fwrite 155545 155225 99.8% Re-fwrite 368702 368955 100.1% Fread 859550 877280 102.1% Re-fread 946345 965794 102.1% Overall GeoMean 464743 469342 101.0% L M B E N C H 3 . 0 S U M M A R Y ------------------------------------ (Alpha software, do not distribute) Processor, Processes - times in microseconds - smaller is better ------------------------------------------------------------------------------ Host OS Mhz null null open slct sig sig fork exec sh call I/O stat clos TCP inst hndl proc proc proc --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- perf4.lab Linux 2.6.9-2 1831 0.36 0.49 5.38 6.68 26.2 0.56 5.72 300. 994. 2912 r4_u2 Linux 2.6.9-2 1831 0.36 0.49 5.35 6.69 26.1 0.56 5.72 303. 994. 2995 r4_u2_hug Linux 2.6.9-2 1831 0.36 0.49 5.41 6.64 26.1 0.56 5.70 299. 985. 2986 Basic integer operations - times in nanoseconds - smaller is better ------------------------------------------------------------------- Host OS intgr intgr intgr intgr intgr bit add mul div mod --------- ------------- ------ ------ ------ ------ ------ perf4.lab Linux 2.6.9-2 0.5400 0.5400 5.4000 33.6 46.0 r4_u2 Linux 2.6.9-2 0.5400 0.5500 5.4100 34.0 45.4 r4_u2_hug Linux 2.6.9-2 0.5400 0.5400 5.4000 33.5 45.3 Basic float operations - times in nanoseconds - smaller is better ----------------------------------------------------------------- Host OS float float float float add mul div bogo --------- ------------- ------ ------ ------ ------ perf4.lab Linux 2.6.9-2 2.7400 3.7900 24.8 17.6 r4_u2 Linux 2.6.9-2 2.7100 3.8300 24.5 17.4 r4_u2_hug Linux 2.6.9-2 2.7000 3.7800 24.5 17.3 Basic double operations - times in nanoseconds - smaller is better ------------------------------------------------------------------ Host OS double double double double add mul div bogo --------- ------------- ------ ------ ------ ------ perf4.lab Linux 2.6.9-2 2.7000 3.8300 28.9 31.4 r4_u2 Linux 2.6.9-2 2.7400 3.7900 29.2 31.8 r4_u2_hug Linux 2.6.9-2 2.7000 3.7900 28.8 31.4 Context switching - times in microseconds - smaller is better ------------------------------------------------------------------------- Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw --------- ------------- ------ ------ ------ ------ ------ ------- ------- perf4.lab Linux 2.6.9-2 8.6700 8.8500 9.1300 8.8900 9.9100 8.87000 10.0 r4_u2 Linux 2.6.9-2 9.4500 11.1 9.7000 9.3100 10.5 10.4 10.9 r4_u2_hug Linux 2.6.9-2 9.0700 11.3 10.3 10.2 10.6 9.61000 13.6 *Local* Communication latencies in microseconds - smaller is better --------------------------------------------------------------------- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn --------- ------------- ----- ----- ---- ----- ----- ----- ----- ---- perf4.lab Linux 2.6.9-2 8.670 24.5 27.7 38.3 44.5 44.2 58.2 65. r4_u2 Linux 2.6.9-2 9.450 25.4 29.4 39.4 50.4 45.3 59.8 65. r4_u2_hug Linux 2.6.9-2 9.070 25.0 28.2 34.3 50.5 45.3 52.7 69. File & VM system latencies in microseconds - smaller is better ------------------------------------------------------------------------------- Host OS 0K File 10K File Mmap Prot Page 100fd Create Delete Create Delete Latency Fault Fault selct --------- ------------- ------ ------ ------ ------ ------- ----- ------- ----- perf4.lab Linux 2.6.9-2 24.9 24.8 79.0 50.0 42.6K 1.666 21.2 r4_u2 Linux 2.6.9-2 25.5 25.2 82.1 50.7 45.2K 1.689 21.1 r4_u2_hug Linux 2.6.9-2 24.7 25.1 80.8 50.4 42.6K 1.663 21.2 *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------------------------- Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- ----- perf4.lab Linux 2.6.9-2 213. 876. 476. 1301.5 1849.0 730.9 716.3 1781 1027. r4_u2 Linux 2.6.9-2 203. 877. 486. 1293.5 1847.0 737.8 716.1 1780 1028. r4_u2_hug Linux 2.6.9-2 203. 873. 480. 1307.7 1849.2 738.5 723.6 1783 1026. Memory latencies in nanoseconds - smaller is better (WARNING - may not be correct, check graphs) ------------------------------------------------------------------------------ Host OS Mhz L1 $ L2 $ Main mem Rand mem Guesses --------- ------------- --- ---- ---- -------- -------- ------- perf4.lab Linux 2.6.9-2 1831 2.1610 15.1 95.4 409.4 r4_u2 Linux 2.6.9-2 1831 2.1910 15.1 95.4 409.9 r4_u2_hug Linux 2.6.9-2 1831 2.1670 15.1 95.5 408.9
A couple of things: o We did a much more careful test of replacing the QLA driver with the U2 version and came up empty: there is less than 1% difference. So the preliminary result held up and our primary suspect seems to have a good alibi. o Shak suggested checking whether audit is turned on: it is not. o We checked whether the elevator was switched somehow: it was not - both use CFQ. o In trying to characterize the differences, we ran "iostat -x 30" during a point run of AIM7 fserver at a load of 2000. There are significant differences between U2 and U3. The following table shows the calculated means for the U2 and U3 runs for each iostat variable shown: Mean U2 U3 wrqm/s 375 483 avgrq-sz 63 104 avgqu-sz 0.8 1.06 await 8.9 11.3 svctm 0.68 0.82 So U3 is merging more write requests per second, it is doing larger IO's, the queues are longer, it is taking longer to get them out of the queue and it is taking longer to service them (the latter presumably because they are larger?)
is this a 16-way have 16 physical cpus? Or is it dual core and/or hyper threaded. An output of /proc/cpuinfo, would resolve the question. thanks.
This is a MAdison IPF system so its single core full 16-cpu IA64 kernel.
ok. Can you try backout the: linux-2.6.13-ia64-multi-core.patch. This is just a guess, if you're looking for things to try. This will also disable the CONFIG_SCHED_SMT .config, which was added during U3. All you need is something like: --- kernel-2.6.spec 13 Jan 2006 21:47:45 -0000 1.1357 +++ kernel-2.6.spec 23 Jan 2006 22:17:08 -0000 @@ -1624,7 +1624,7 @@ %patch429 -p1 %patch430 -p1 %patch431 -p1 -%patch432 -p1 +#%patch432 -p1 %patch433 -p1 %patch434 -p1 %patch435 -p1
That was a (small) step in the right direction: the point run @ 2000 attained a throughput of 17235, roughly a 3% increase.
hmmm. Well the ext3 changes, qlogic changes, and scheduler changes give us back about 7%, which would explain abuot half of this performance issue. But certainly there could be interaction b/w those patches or with other patches that would increase or decrease that figure.
We systematically applied all the patches that were new with U3. The ones that seem to account for all of the regression are patch 1997, the sched-pin-inline patch, and patch 1458, the kprobes scalability patch. Before those two were applied, the throughput @ 2000 was 18654 (and had stayed at 18700 +/- 150 with *all* the other patches). Adding patch 1997 brought the throughput down to 17240. Adding patch 1458 on top of that brought it down to 16953. Applying just patch 1458 on top of everything, except patch 1997, brought the throughput down to 18349. We are now doing experiments at the other end: starting with *no* patches applied (except 249 and 2554, otherwise rpmbuild barfs), we are applying just the two above to see if there are interactions with other patches. Also, patch 1997 contains three patches that seem independent of each other, so we'll try splitting it apart and applying the pieces.
not sure if this patch will improve performance but it packs the task_aux structure better. And could potentail help with bouncing cachelines. --- linux-2.6.9/include/linux/sched.h.bak 2006-02-01 13:10:29.000000000 -0500 +++ linux-2.6.9/include/linux/sched.h 2006-02-01 13:11:29.000000000 -0500 @@ -468,10 +468,10 @@ struct task_struct_aux { struct key *thread_keyring; /* keyring private to this thread */ unsigned char jit_keyring; /* default keyring to attach requested keys to */ #ifndef __GENKSYMS__ - struct key *request_key_auth; /* assumed request_key authority */ #if defined(CONFIG_SMP) int last_waker_cpu; /* CPU that last woke this task up */ #endif + struct key *request_key_auth; /* assumed request_key authority */ #endif };
The problem seems to be caused by the third part of the patch: --- linux-2.6.9/kernel/sched.c.orig 2005-11-17 01:56:18.000000000 -0500 +++ linux-2.6.9/kernel/sched.c 2005-11-17 02:28:16.000000000 -0500 @@ -1150,6 +1150,9 @@ static int try_to_wake_up(task_t * p, un new_cpu = cpu; + if (task_aux(p)->last_waker_cpu != this_cpu) + goto out_set_cpu; + if (cpu == this_cpu || unlikely(!cpu_isset(this_cpu, p->cpus_allowed))) goto out_set_cpu; @@ -1224,6 +1227,8 @@ out_set_cpu: cpu = task_cpu(p); } + task_aux(p)->last_waker_cpu = this_cpu; + out_activate: #endif /* CONFIG_SMP */ if (old_state == TASK_UNINTERRUPTIBLE) { @@ -1295,6 +1300,9 @@ void fastcall sched_fork(task_t *p) #ifdef CONFIG_SCHEDSTATS memset(&p->sched_info, 0, sizeof(p->sched_info)); #endif +#if defined(CONFIG_SMP) + task_aux(p)->last_waker_cpu = smp_processor_id(); +#endif #ifdef CONFIG_PREEMPT /* * During context-switch we hold precisely one spinlock, which I applied just this patch with no other U3 patches (except the two that are necessary for rpmbuild not to barf) and I get the bulk of the regression accounted for: Tasks jobs/min jti jobs/min/task real cpu 2000 17189.90 88 8.5950 705.07 10153.04 Thu Feb 2 00:28:31 2006 2300 16940.32 88 7.3654 822.77 11923.56 Thu Feb 2 00:42:26 2006 2500 16801.41 89 6.7206 901.71 13097.94 Thu Feb 2 00:57:40 2006 The throughput @ 2000 without the patch is around 18740. I didn't have access to BZ#164444 to see if this part of the patch has anything to do with it: from the description, I get the (quite possibly incorrect) impression that 164444 was solved by the first part of the sched-pin-inline patch. The question is: are the two parts independent? can we roll back just part 3? One more point: I have not tried rearranging the task_struct_aux fields as Jason suggested, because there is no request_key_auth field at this point. It gets added by another U3 patch. I'll go back to the full U3 and do the rearrangement, but given the data above, I don't have high expectations of it doing much to improve the situation.
There was no improvement with the rearranged task_struct_aux patch on top of all the patches: Tasks jobs/min jti jobs/min/task real cpu 2000 16772.83 90 8.3864 722.60 10457.63 Thu Feb 2 14:30:35 2006 2300 16509.95 89 7.1782 844.22 12270.38 Thu Feb 2 14:44:51 2006 2500 16422.41 89 6.5690 922.52 13460.91 Thu Feb 2 15:00:26 2006
Created attachment 124111 [details] Removing handling of DIE_PAGE_FAULT in Kprobes For every page fault, I saw that Kprobes execption is getting called and is doing preempt disable and then reenable. For now I am commenting this and can you see if this bring performance back.
Created attachment 124230 [details] fserver results with U3 beta1 *except* for patch1997-part3 The graph shows that eliminating part 3 of patch 1997 brings AIM7 fileserver performance to within 3-4% of U2 performance (with the patch, the performance drop is > 10%). I applied Anil's patch (commenting out the DIE_PAGE_FAULT case in the kprobes code) on top of the above configuration (U3 except 1997-part3). I have not done a full run yet, but a few spot checks are very encouraging: I get 18640 @ 2000 which puts the performance drop (if any) in the noise region.
we are trying to reproduce the aim7 regression in our lab and I can't seem to be able to reproduce the result. How many disk does HP have in their setup? I'm using OSDL's aim7. Is that what HP folks are using?
We are using the version from sourceforge: http://sourceforge.net/project/showfiles.php?group_id=38012&package_id=47826&release_id=89399 The configuration is a 16-cpu machine with 64Gb of memory and 12 MSA1000s with 144 disks. Each disk is 72Gb with a single ext3 filesystem.
Created attachment 124396 [details] original wake balance patch by Ingo
Nick, please experiment patch id=124396. It is relative to kernel-2.6.9-29.EL.
Ken, I tried the patch relative to a beta1 (2.6.927.EL) kernel. The result is much worse: at 2000, the throughput was 14700, vs 16900 without the patch applied and 18700 with Update 2.
Created attachment 124421 [details] dynamically control wake balancing behavior
Thank you Nick, this experiment and your earlier experiments all point to the fact that for aim7, the best performance is achieved with load balancing in the wake up path. The characteristics of sequence of CPU that executes try_to_wake_up() is a bit random with aim7 workload and on U3 beta, it basically shortcut some of the load balancing action. And it hurts aim7. With patch id=124396, the effect is more amplified that we only do load balancing if the waker CPU is idle. Thus sort of represent worst case scenario. However, for other workloads like TPC, the requirement is exactly opposite. In the wake up path, best performance is achieved with absolutely zero load balancing. We simply wake up the process on the CPU that it was previously run. Worst performance is obtained when we do load balancing at wake up. There isnât an easy way to auto detect the workload characteristics. Ingoâs earlier patch that detects idle CPU and decide whether to load balance or not doesnât perform with aim7 since all CPUs are busy. What Iâm proposing here is to add a sysctl variable to control the behavior of load balancing in wake up path, so user can dynamically select a mode that best fit for the workload environment. And kernel can achieve best performance in two extreme ends of incompatible workload environments. Patch attached (id 124421).
Ken, I ran with this latest patch on top of a U3-beta1 kernel and get the expected results: 18250 @ 2000. Anil's kprobes patch (id=124111) should get us an improvement of around 400, so provided that these two patches are adopted, we should be on par with U2 performance in the fserver benchmark.
Created attachment 124517 [details] AIM7 shared workload - U2, U3 beta1 and U3 beta1+patches This attachment shows a graph of the results of the AIM7 shared workload in Update2, Update3 beta1 and Update3 beta1 with the two patches: Ken's patch (attachment id= 124421) and Anil's patch (attachment id=124111). The next attachment shows dbase results (partial: the run is not finished yet but the region of interest is covered). Both of them show a small regression relative to Update2 (about 0.5%) but a big improvement over the results with the unpatched Update3 beta1. We would be happy with a release that incorporates these two patches.
Created attachment 124518 [details] AIM7 dbase workload results for U2, U3 beta1 and U3 beta1+patches See comment in previous attachment 124517 [details].
I spot checked (fserver @ loads between 2000 and 2500 and dbase at loads between 10000 and 15000) the 2.6.9-31 kernel. The results look good: any regression from Update 2 is certainly minor, less than 0.5% (and quite possibly non-existent): fserver AIM Multiuser Benchmark - Suite VII Run Beginning Tasks jobs/min jti jobs/min/task real cpu 2000 18669.51 87 9.3348 649.19 9544.14 Mon Feb 13 19:55:07 2006 2300 18251.55 87 7.9355 763.66 11222.36 Mon Feb 13 20:08:03 2006 2500 18297.63 87 7.3191 827.98 12201.98 Mon Feb 13 20:22:04 2006 AIM Multiuser Benchmark - Suite VII Testing over dbase AIM Multiuser Benchmark - Suite VII Run Beginning Tasks jobs/min jti jobs/min/task real cpu 10000 59682.12 93 5.9682 995.27 15561.54 Mon Feb 13 21:19:56 2006 12000 59753.44 94 4.9795 1192.90 18680.08 Mon Feb 13 21:39:56 2006 14000 59133.60 93 4.2238 1406.31 21963.22 Mon Feb 13 22:03:30 2006 15000 59071.80 95 3.9381 1508.33 23647.12 Mon Feb 13 22:28:46 2006 AIM Multiuser Benchmark - Suite VII Testing over
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html