=Comment: #0================================================= Amrith Anandan <amrith.anandan.com> - OS: MRG H/W : (x3650 M2) Victory --- Running testcase matrix_mult -c 0.75 --- Fri Jan 9 17:24:33 PST 2009 Logging to /home/cheezo/linux-rt-tests/ltp/testcases/realtime/logs/elm3b120-x86_64-2.6.24.7-94.el5rt-2009-09-01-matrix_mult-c0.75.log --------------------------------------- Matrix Multiplication (SMP Performance) --------------------------------------- Running 128 iterations Matrix Dimensions: 100x100 Calculations per iteration: 8 Number of CPUs: 16 Running sequential operations Min: 58274 us Max: 58503 us Avg: 58304.0000 us StdDev: 49.1937 us Min: 96340 us Max: 101415 us Avg: 6220.0000 us StdDev: 974.8007 us Concurrent Multipliers: Min: 0.6049 Max: 0.5769 Avg: 9.3736 Criteria: 12.00 * average concurrent time < average sequential time Result: FAIL Fri Jan 9 17:24:41 PST 2009 The matrix_mult test appears to have completed. =Comment: #1================================================= Amrith Anandan <amrith.anandan.com> - 16x2.67 GHz (2 Socket, Intel Xeon Quad Core 2 threads/core Nehalem) =Comment: #3================================================= Dinakar Guniguntala <dino.com> - So I was going through the Intel Nehalem manuals today and it looks like with HT turned on, we can expect a perf improvement in the range of 20% - 30% given the right application. The results indicate ~5% improvement in performance, which is expected as matrix mult is pretty cpu intensive. So I think we can do one of 2 things 1. Turn off HT in the bios and rerun the tests. Matrix mult should pass in this case 2. Update matrix mult testcase to check for threading and have a lower threshold. I think we should go with 1 for now to confirm if HT is the issue. =Comment: #5================================================= Sripathi Kodi <sripathik.com> - Initial observations: This machine has 16 virtual cpus, of which: 2 threads per cpu 4 cores per socket 2 sockets. Test run as-is: FAIL. Then I changed the test case to make numcpus=8 (it is sysconf(_SC_NPROCESSORS_ONLN) by default). That is, it will only run on 8 cpus. I then ran it on different combination of 8 cpus. taskset -c 1,3,5,7,9,11,13,15 ./matrix_mult : PASS (on 1st thread of every cpu core) taskset -c 0,2,4,6,8,10,12,14 ./matrix_mult : PASS (on 2nd thread of every cpu core) taskset -c 0,1,2,3,4,5,6,7 ./matrix_mult : PASS (on first 8 cpus, which are all on the same socket) So it looks like this has got to do with the number of cpus being increased to 16. I will do the following: 1) Compare the results of above runs 2) Look closely at the test case to see whether it needs changes to run on 16 cpus. =Comment: #6================================================= Sripathi Kodi <sripathik.com> - OK, here is some more analysis. With modified test case that spawns only 8 threads: taskset -c 1,3,5,7,9,11,13,15 ./matrix_mult Concurrent multipliers: Avg: 8.4955 (It was always very close to 8.5) taskset -c 0,2,4,6,8,10,12,14 ./matrix_mult Concurrent multipliers: Avg: 8.4969 (Again, very close to 8.5) taskset -c 0,1,2,3,4,5,6,7 ./matrix_mult Concurrent Multipliers: Avg: 8.4936 (Same as the above two) So the above suggests that HT does not have much effect in this configuration/testcase. Also, the last test suggests that the problem may not be due to cache line bouncing between the two cpus, but due to contention for some locks as the number of cpus increases. Also, with the default 16 threads, increasing the load (-l) up to 128 did not seem to make any difference on the numbers. I wonder if that option works at all. =Comment: #7================================================= Sripathi Kodi <sripathik.com> - Yet another interesting data point. I deployed another 16 cpu x86_64 machine (elm3a176). This machine has 4 sockets, dual core, 2 threads per core. Even on this machine, matrix_mult failed. Concurrent multiplier average was in the range of ~ 9.9 on this machine. Interestingly, running with 8 threads on 8 cpus also fails on this machine! Ideally I should have tried it on a 16 cpu machine with no HT. However, all data so far is pointing towards some lock contention on high number of cpus. Next step is to oprofile and see what it shows up. Before that, I wish to try a few things on bug 51071 (Calibrate problem) to see if that too exhibits similar problems. =Comment: #8================================================= Dinakar Guniguntala <dino.com> - > So the above suggests that HT does not have much effect in this > configuration/testcase. Also, the last test suggests that the problem may not > be due to cache line bouncing between the two cpus, but due to contention for > some locks as the number of cpus increases. I would suggest turning off HT in bios and then running your tests before you make any conclusions here. That is definitely a data point we need to figure out what exactly is going on =Comment: #9================================================= Sripathi Kodi <sripathik.com> - I booted elm3a112 (Victory) with maxcpus=8. This essentially shuts off the 8 virtual cpus in the second socket. With this, I have seen the test always pass. =Comment: #10================================================= Darren V. Hart <dvhltc.com> - > I booted elm3a112 (Victory) with maxcpus=8. This essentially shuts off the 8 > virtual cpus in the second socket. With this, I have seen the test always pass. That is weird, you also said: > taskset -c 0,1,2,3,4,5,6,7 ./matrix_mult : PASS (on first 8 cpus, which are > all on the same socket) So both of those statements actually argue against HT being the problem, since even at 8way we'd expect problems if we're counting hardware threads as CPUs. And the concurrent multipliers didn't seems to change between runs on completely separate cores vs. those using the hardware threads. =Comment: #11================================================= Sripathi Kodi <sripathik.com> - Yes, all my testing so far has suggested that the problem could more likely be due to increased number of cpus, rather than HT. =Comment: #12================================================= Sripathi Kodi <sripathik.com> - I tried to use oprofile on this machine to understand where all the time is being spent. However, oprofile doesn't seem to work on this machine with the error "opreport error: No sample file found: try running opcontrol --dump or specify a session containing sample files". We have seen such bugs in the past (bug 41292, bug 43249, bug 45219). The error probably means the new Nehalem cpu isn't supported by the MRG kernel. I then passed nmi_watchdog=1 to the kernel and this got oprofile working. However, IIRC this means oprofile is using a low sampling rate, which makes it not very useful. I searched in Linus' kernel git tree for the string "Nehalem" and found some patches. We might need to backport some of these patches to get oprofile working on this machine. I will raise a new bug for this. =Comment: #14================================================= Sripathi Kodi <sripathik.com> - Going through the matrix_mult test case, I feel this should be a reasonable test even on a non-rt system. I ran the test on vanilla kernels (RHEL5.2 kernel, 2.6.28 based kernel) and sure enough, it passes, though with a slightly lower performance compared to real-time kernels. I then ran 2.6.28.2 kernel on Victory (elm3a112) and saw that the test fails. It produces numbers very close to the MRG kernels. This is another indication that the problem here may be due to inability of the test to perform well on a machine with high number of cpus. The problem could be with the test case too! I have gotten oprofile to work on 2.6.28.2 kernel, not yet on MRG kernel. Please see bug 51340 for details about that. I will run the test through oprofile and see if I can gather more information. =Comment: #15================================================= Dinakar Guniguntala <dino.com> - These are Destiny systems (x3550 M2) and not Victory (x3650 M2) =Comment: #16================================================= Sripathi Kodi <sripathik.com> - Using Kiran's work in bug 51340, we got a reasonably working oprofile on MRG kernels. I say reasonably because oprofile only seems to work once. The second time I try to start it, the system crashes/hangs. Highlights from the profile look like this: opreport --long-filenames ------------ CPU: Intel Architectural Perfmon, speed 1596 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 CPU_CLK_UNHALT...| samples| %| ------------------ 636645 88.3410 /usr/src/redhat/BUILD/kernel-rt-2.6.24.7-100/linux-2.6.24.7.x86_64_kiran/vmlinux 72667 10.0833 /home/rtuser/ltp-full-20081231/testcases/realtime/func/matrix_mult/matrix_mult 2412 0.3347 /lib64/libc-2.5.so 2220 0.3080 /usr/local/bin/oprofiled 1566 0.2173 /ehci_hcd <snip> opreport of vmlinux -------------------------- CPU: Intel Architectural Perfmon, speed 1596 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 160248 24.5952 read_hpet 45880 7.0418 apic_timer_interrupt 42684 6.5512 find_busiest_group 35257 5.4113 __schedule 34376 5.2761 acpi_safe_halt 30016 4.6069 cpupri_set 12804 1.9652 source_load <snip> opreport of matrix_mult ------------------------------- CPU: Intel Architectural Perfmon, speed 1596 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 71561 98.4780 matrix_mult 1098 1.5110 matrix_init 7 0.0096 calibrate_busyloop 1 0.0014 concurrent_thread I will attach the output of "opannotate --assembly --source <vmlinux>" to this bug. =Comment: #17================================================= Darren V. Hart <dvhltc.com> - If I'm reading the oprofile output correctly, we're spending we have an order of magnitude more samples in the kernel than we do in userspace, and 24% of the time in the kernel is spent reading the hpet? The userspace side looks on track, with 98+% spent running the test. How does this compare to when we are running with only 8 CPUs? =Comment: #18================================================= Sripathi Kodi <sripathik.com> - opannotate output was too big to be attached. I have put it here: http://kernel.beaverton.ibm.com/~sripathi/51189/opannotate_vmlinux.txt =Comment: #19================================================= Sripathi Kodi <sripathik.com> - As another data point, I disabled HT in the BIOS. This means we will have only 8 cpus. With this, matrix_mult passes as expected, with a gain of around 8.5 against expected 6: Running sequential operations Min: 97205 us Max: 97358 us Avg: 97222.0000 us StdDev: 15.4834 us Min: 91079 us Max: 93399 us Avg: 11447.0000 us StdDev: 518.4039 us Concurrent Multipliers: Min: 1.0673 Max: 1.0424 Avg: 8.4932 The numbers are almost identical to the test being run on just 8 cpus using taskset as mentioned in comment #5. With this, we can safely conclude that the problem doesn't have anything to do with Hyper Threading, but to do with the increased number of cpus. We will now have to treat this as a generic scalability problem. =Comment: #20================================================= Darren V. Hart <dvhltc.com> - > --- Comment #19 from Sripathi Kodi <sripathik.com> 2009-02-02 11:27:16 EST --- > As another data point, I disabled HT in the BIOS. This means we will have only > 8 cpus. With this, matrix_mult passes as expected, with a gain of around 8.5 > against expected 6: > > Running sequential operations > Min: 97205 us > Max: 97358 us > Avg: 97222.0000 us > StdDev: 15.4834 us > Min: 91079 us > Max: 93399 us > Avg: 11447.0000 us > StdDev: 518.4039 us > > Concurrent Multipliers: > Min: 1.0673 > Max: 1.0424 > Avg: 8.4932 One wonders how 8 CPUs can do 8.5 the work of 1 CPU... I suppose the 1 CPU may take more overhead that doesn't scale up with CPUs, so if CPU intensive tasks are well scheduled, they should be able to attain a slightly higher concurrency factor than a simple NR_CPUS multiply would provide under the right conditions. > The numbers are almost identical to the test being run on just 8 cpus using > taskset as mentioned in comment #5. With this, we can safely conclude that the > problem doesn't have anything to do with Hyper Threading, but to do with the > increased number of cpus. > > We will now have to treat this as a generic scalability problem. Agreed =Comment: #21================================================= Sripathi Kodi <sripathik.com> - (In reply to comment #17) > If I'm reading the oprofile output correctly, we're spending we have an order > of magnitude more samples in the kernel than we do in userspace, and 24% of the > time in the kernel is spent reading the hpet? The userspace side looks on > track, with 98+% spent running the test. How does this compare to when we are > running with only 8 CPUs? I tried this on an HS21XM today (elm3c190) and the results are pretty interesting. vmlinux doesn't show up as much as it does on Nehalem machines: [root@elm3c190 profile]# head opreport_long CPU: Core 2, speed 2833.42 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 CPU_CLK_UNHALT...| samples| %| ------------------ 53250 62.0572 /home/rtuser/ltp-full-20081231/testcases/realtime/func/matrix_mult/matrix_mult 25447 29.6557 /usr/lib/debug/lib/modules/2.6.24.7-100.el5rt/vmlinux 4111 4.7909 /vmlinux-unknown 1905 2.2201 /ipmi_si 404 0.4708 /lib64/libc-2.5.so [root@elm3c190 profile]# head opreport_matrix_mult CPU: Core 2, speed 2833.42 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % symbol name 52465 98.5258 matrix_mult 777 1.4592 matrix_init 7 0.0131 calibrate_busyloop 1 0.0019 buffer_init [root@elm3c190 profile]# head opreport_vmlinux CPU: Core 2, speed 2833.42 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % symbol name 5156 20.2617 mwait_idle 1809 7.1089 find_busiest_group 1468 5.7689 source_load 1466 5.7610 read_tsc 1216 4.7786 __schedule 1107 4.3502 cpupri_set 998 3.9219 apic_timer_interrupt Also, read_hpet is nowhere to be seen in the full list of vmlinux. Assuming the oprofile results seen on Nehalem so far can be trusted, this surely points to some bottleneck in the kernel. =Comment: #25================================================= Sripathi Kodi <sripathik.com> - Note to Red Hat: --------------------- We are suspecting this bug to be showing problems with RT kernel scaling to 16 cpus. It is unlikely to be a problem with HT.
I am working on Vernon's observations that the test performs much worse at 9 cpus than at 8 cpus. With 9 cpus, concurrent average is usually around 5.3, whereas with 8 cpus it is around 8.5. This is a HUGE drop in performance when the 9th cpu is added. The following analysis was done with the system booted with maxcpus=9. This enables all cores of the system and 1 HT. In the concurrent part, we measure the time at the beginning and end of concurrent operations from main_thread() function. We don't use the average time spent by individual threads as this comment clearly indicates: cavg = delta/iterations; /* don't use the stats record, use the total time recorded */ Out of curiosity, I printed the average concurrent time taken calculated as an average of time taken by individual threads: printf("Avg from stats = %.4f us\n", stats_avg(&cdat)/numcpus); This turned out to be far lower than cavg as calculated by the test. For example, a typical 9-cpu run shows: Avg: 18321.0000 us Avg from stats = 11957.8184 us In fact using the second value makes the test PASS :-) This indicates either of the following: 1) There is some error in calculating concurrent time spent by main_thread() 2) One of the threads take a significantly more time than the rest of the threads. This pulls up the concurrent average in both cases, but it affects time calculation as it is now much more, because main_thread() waits till all threads are finished before recording time. More later. Okay, here is more interesting stuff: I am still running with maxcpus=9, so it has enabled: all 4 cores on first socket all 4 cores on second socket 1 thread on first core of first socket. This is logical cpu 8. I printed out the time taken by each concurrent multiplier by making some changes to the test case. I also print the cpuid and index. This index is the index in cdat.records[] that corresponds to this particular thread. This will help me figure out which thread is running on the second hardware thread. Running concurrent operations cpuid = 0, Index = 0 cpuid = 1, Index = 15 cpuid = 2, Index = 60 cpuid = 3, Index = 75 cpuid = 4, Index = 105 cpuid = 5, Index = 120 cpuid = 6, Index = 30 cpuid = 7, Index = 45 cpuid = 8, Index = 90 <=== So data at index 90 is from the second thread Thread 0: tsum = 2475710, Avg = 165047 us <=== High value Thread 15: tsum = 1368897, Avg = 91259 us Thread 30: tsum = 1368771, Avg = 91251 us Thread 45: tsum = 1371082, Avg = 91405 us Thread 60: tsum = 1368681, Avg = 91245 us Thread 75: tsum = 1368900, Avg = 91260 us Thread 90: tsum = 2475534, Avg = 165035 us <=== High value Thread 105: tsum = 1369101, Avg = 91273 us Thread 120: tsum = 1368814, Avg = 91254 us Min: 91079 us Max: 167756 us Avg: 18339.0000 us Avg from stats = 11963.3662 us StdDev: 30672.6387 us The above sequence repeats very consistently. Two threads always report very high numbers: thread 0 and whichever thread is on the second hardware thread. This possibly means HT is involved. I need to confirm by running on 16 cpus and such. In case there is any interest in this debug version of test case, I have put it here: http://kernel.beaverton.ibm.com/~sripathi/51189/matrix_mult_debug_v1.c
I booted the system with maxcpus=12 cpus. From /proc/cpuinfo, I determined that this give us the following pattern: cpu0, core0: 0, 8 cpu0, core1: 2, 10 cpu0, core2: 4 cpu0, core3: 6 cpu1, core0: 1, 9 cpu1, core1: 3, 11 cpu1, core2: 5 cpu1, core3: 7 With this, I observed that cpus 4, 5, 6 and 7 always report noticeably better (lower) numbers compared to the other cpus. This is now pointing at HT being the culprit! Notice the numbers below. This pattern repeats every time. Running concurrent operations cpuid = 0, Index = 0 cpuid = 1, Index = 11 cpuid = 2, Index = 44 cpuid = 3, Index = 66 cpuid = 4, Index = 88 cpuid = 5, Index = 110 cpuid = 6, Index = 22 cpuid = 7, Index = 55 cpuid = 8, Index = 33 cpuid = 9, Index = 77 cpuid = 10, Index = 99 cpuid = 11, Index = 121 Thread 0: tsum = 1788118, Avg = 162556 us Thread 11: tsum = 1818666, Avg = 165333 us Thread 22: tsum = 1004678, Avg = 91334 us <== cpu 6 Thread 33: tsum = 1784692, Avg = 162244 us Thread 44: tsum = 1816586, Avg = 165144 us Thread 55: tsum = 1004661, Avg = 91332 us <== cpu 7 Thread 66: tsum = 1814128, Avg = 164920 us Thread 77: tsum = 1818686, Avg = 165335 us Thread 88: tsum = 1004710, Avg = 91337 us <== cpu 4 Thread 99: tsum = 1816600, Avg = 165145 us Thread 110: tsum = 1004564, Avg = 91324 us <== cpu 5 Thread 121: tsum = 1814040, Avg = 164912 us
I tested this with maxcpus=15. With this, the only cpu which did not have it's HT enabled, cpu7, turned up much better numbers than all other cpus. I then enabled all 16 cpus on the system (All threads enabled). With this, I see very consistent numbers across all cpus. Also, as a result, I see that the two ways of calculating averages return near identical numbers: Running concurrent operations cpuid = 0, Index = 0 cpuid = 1, Index = 8 cpuid = 2, Index = 32 cpuid = 3, Index = 48 cpuid = 4, Index = 64 cpuid = 5, Index = 88 cpuid = 7, Index = 24 cpuid = 8, Index = 72 cpuid = 6, Index = 112 cpuid = 9, Index = 96 cpuid = 10, Index = 120 cpuid = 11, Index = 16 cpuid = 12, Index = 40 cpuid = 13, Index = 56 cpuid = 14, Index = 80 cpuid = 15, Index = 104 Thread 0: tsum = 1327000, Avg = 165875 us Thread 8: tsum = 1320238, Avg = 165029 us Thread 16: tsum = 1320609, Avg = 165076 us Thread 24: tsum = 1327789, Avg = 165973 us Thread 32: tsum = 1321766, Avg = 165220 us Thread 40: tsum = 1323060, Avg = 165382 us Thread 48: tsum = 1320673, Avg = 165084 us Thread 56: tsum = 1322942, Avg = 165367 us Thread 64: tsum = 1323120, Avg = 165390 us Thread 72: tsum = 1326753, Avg = 165844 us Thread 80: tsum = 1317434, Avg = 164679 us Thread 88: tsum = 1322985, Avg = 165373 us Thread 96: tsum = 1320192, Avg = 165024 us Thread 104: tsum = 1327635, Avg = 165954 us Thread 112: tsum = 1317665, Avg = 164708 us Thread 120: tsum = 1321815, Avg = 165226 us Min: 164170 us Max: 168977 us Avg: 10375.0000 us <====== Avg from stats = 10332.8496 us <====== StdDev: 1133.8242 us Notice that the two averages are almost equal. I feel this suggests that the way the test case calculates times and averages is alright. Now, some conclusions: ===================== 1) The problem surely needs to be related to Hyper Threading. 2) Interesting observation looking at numbers: a) From comment #32, with maxcpus=9, among individual cpus, high values were in 247xxxx range, whereas low values were in 136xxxx range. b) From comment #33, with maxcpus=15, among individual cpus, high values were in 181xxxx range, whereas low values were in 100xxxx range. c) From comment #34, with maxcpus=16, among individual cpus, all values were near 132xxxx. This seems to indicate that enabling more and more threads is improving the performance on each core. That is, the worst case performance seems to be improving with more and more threads being enabled. I am not sure whether this can be attributed to something in the hardware or something in the scheduler design. What next: ========== 1) We need to test with the latest available kernels. I guess the behavior hasn't changed. 2) I will chat with some scheduler folks tomorrow to see if they think this needs to be looked at from scheduler point of view. 3) If the above two go as predicted, we will conclude that disabling HT is the way to resolve this problem.
Sripathi, excellent analysis. This is what we expected to begin with, however I am curious why the earlier tests of using taskset to use the first 4 cpus and the first 4 threads, leaving the last 4 cores completely unused, didn't result in higher numbers as well. I the system perhaps ignoring our request and distributing the threads to real cpus first, and then to threads despite the taskset setting? Also, according to Dinakar, the Intel specs suggest a 20-30% improvement with threads for *certain* workloads. You observed a single cpu take ~90ms while with both hardware threads it would take ~160ms. So what a single cpu can do in 180ms (90ms*2), a threaded CPU can do in 160ms. (180-160)/180 = 11% performance improvement. Given the variance in benchmarks and the cpu intensive nature of this one, I think an 11% performance improvement is pretty well inline with what we should expect.
(In reply to comment #36) > Sripathi, excellent analysis. This is what we expected to begin with, however > I am curious why the earlier tests of using taskset to use the first 4 cpus and > the first 4 threads, leaving the last 4 cores completely unused, didn't result > in higher numbers as well. I the system perhaps ignoring our request and > distributing the threads to real cpus first, and then to threads despite the > taskset setting? 1) taskset -c 0,1,2,3,4,5,6,7 chooses first threads of all cores of both sockets in the system. 2) taskset -c 0,2,4,6,8,10,12,14 chooses all cores and threads on 1st socket and nothing from second socket. 3) taskset -c 1,3,5,7,9,11,13,15 chooses all cores and threads on 2nd socket and nothing from first socket. Thinking about this now, the test should have FAILed with options 2 and 3 above. I suspect it has got to do with how we do cpu affinity in the test case. I'll keep this in mind when I look at the test case again today.
We had a discussion with Vaidy and Bharata here today about the part played by hardware and scheduler in this problem. We are curious about why the worst case performance on any core improves as more and more threads are enabled in other cores. Kiran is going to do the following to further understand what is going on here: 1) Test with maxcpus=8 and disabling HT in BIOS and compare the two results. We are not sure whether the unused threads in cores can in some way disrupt functioning of active thread. This test will confirm it. 2) Confirm that the setaffinity calls in the test are really working and there are no ping-pong effects. 3) Verify whether the method used for data collection in my debug code is correct. 4) Try to modify the test case to avoid any shared memory references between different threads. Currently result array is shared across threads, which can lead to false cache line sharing.
Just another data point: The behaviour of the test is exactly same on 2.6.29-rc4-rt2-tip.
Node : elm3c175 Machine : HS22 Piranha OS : SLERT10GMC --- Running testcase matrix_mult -c 0.75 --- Wed Feb 18 18:42:30 CST 2009 Logging to /test/gowri/slert/linux-rt-tests/ltp/testcases/realtime/logs/elm3c175-x86_64-2.6.22.19-0.14-rt-2009-18-02-matrix_mult-c0.75.log --------------------------------------- Matrix Multiplication (SMP Performance) --------------------------------------- Running 128 iterations Matrix Dimensions: 100x100 Calculations per iteration: 8 Number of CPUs: 16 Running sequential operations Min: 58035 us Max: 58520 us Avg: 58095.0000 us StdDev: 86.2324 us Running concurrent operations Min: 98162 us Max: 102970 us Avg: 6206.0000 us StdDev: 1374.6295 us Concurrent Multipliers: Min: 0.5912 Max: 0.5683 Avg: 9.3611 Criteria: 12.00 * average concurrent time < average sequential time Result: FAIL Wed Feb 18 18:42:39 CST 2009 The matrix_mult test appears to have completed. ------------------------------------------------------------------------------------------ This test also fails on the same platform with MRG 101rt kernel ------------------------------------------------------------------------------------------ OS : MRG101rt --- Running testcase matrix_mult -c 0.75 --- Fri Feb 13 19:27:59 EST 2009 Logging to /test/amrith/101/linux-rt-tests/ltp/testcases/realtime/logs/elm3c175-x86_64-2.6.24.7-101.el5rt-2009-13-02-matrix_mult-c0.75.log --------------------------------------- Matrix Multiplication (SMP Performance) --------------------------------------- Running 128 iterations Matrix Dimensions: 100x100 Calculations per iteration: 8 Number of CPUs: 16 Running sequential operations Min: 58302 us Max: 58751 us Avg: 58351.0000 us StdDev: 78.8359 us Min: 97634 us Max: 101083 us Avg: 6202.0000 us StdDev: 751.6913 us Concurrent Multipliers: Min: 0.5971 Max: 0.5812 Avg: 9.4084 Criteria: 12.00 * average concurrent time < average sequential time Result: FAIL Fri Feb 13 19:28:07 EST 2009 The matrix_mult test appears to have completed.
Created attachment 332525 [details] Piranha (HS22) system info
I was playing around with 2.6.29-rc6-tip kernel for bug 51340 and I collected some statistics using oprofile and performance counters (Ingo's new stuff) on this machine. The reported matrix_mult scalability problem can be recreated on mainline kernel as well, so the numbers I have obtained are relevant for this bug too. I ran the test with 8 and 9 cpus and compared the results. The only thing that interested me was the number of "cache references" reported by Ingo's perfstat. This number was significantly higher in the case of 9 cpus as compared to 8 cpus. Example: 8 cpus ---------- 5915822.044589 task clock ticks (msecs) 33382919396 instructions (events) 864042 cache references (events) <========= 120540 cache misses (events) 1206357996 branches (events) 12170749 branch misses (events) 22310884840 bus cycles (events) 1504 pagefaults (events) 7 context switches (events) 5 CPU migrations (events) Wall-clock time elapsed: 8397.334568 msecs 9 cpus ------- 244423.718692 task clock ticks (msecs) 32923054623 instructions (events) 5398471 cache references (events) <========= 116694 cache misses (events) 1190252003 branches (events) 11485947 branch misses (events) 23563660280 bus cycles (events) 1503 pagefaults (events) 5 context switches (events) 5 CPU migrations (events)
Cache references go up by an order of magnitude, but cache misses actually go down.... hrmm.. are we confident that the data collected is correct?
(In reply to comment #46) > Cache references go up by an order of magnitude, but cache misses actually go > down.... hrmm.. are we confident that the data collected is correct? The procedure to use perfstat.c was pretty simple, so I am unlikely to have gone wrong with that. I don't know how accurate perfstat is, though.
I did a couple of tests as mentioned by Sripathi in comment #40. 1) I tested with ht enabled and maxcpus=8. The problem does not occur. I repeated it with ht disabled and the results are similar to the case when ht is enabled. The problem does not occur. This confirms that unused threads in the cores do not disrupt the functioning of active thread. 2) I booted the system with maxcpus=9. I printed out the cpu ids on which the threads are running and these ids remain the same throughout. Also checked the output of top -H. There's a uniform load on the threads throughout and no sudden changes in the load. Thus it can be confirmed that setaffinity calls are working fine and there is no ping ponging of the threads. 3) I modified the test case to avoid any shared memory references between different threads. I created a separate results array for each thread to hold the results so the threads do not access the shared array. The test case fails in this case as well but the results are much better than the other results. The results for maxcpus=9 is Running concurrent operations cpuid = 0, Index = 0 cpuid = 1, Index = 30 cpuid = 2, Index = 45 cpuid = 3, Index = 60 cpuid = 4, Index = 75 cpuid = 5, Index = 90 cpuid = 6, Index = 105 cpuid = 7, Index = 120 cpuid = 8, Index = 15 Thread 7: tsum = 95933, Avg = 6395 us Thread 3: tsum = 95919, Avg = 6394 us Thread 4: tsum = 95944, Avg = 6396 us Thread 5: tsum = 95940, Avg = 6396 us Thread 6: tsum = 95950, Avg = 6396 us Thread 2: tsum = 95933, Avg = 6395 us Thread 1: tsum = 95939, Avg = 6395 us Thread 8: tsum = 169954, Avg = 11330 us Thread 0: tsum = 170074, Avg = 11338 us Concurrent Multipliers Avg: 5.2216
Is there anything else we want to try and do here? The only thing I can think of is to adjust our test cases to account for HT type CPUs, but I don't think it's really worth it as we can just turn it off in the bios. Can we close this out as NOT_A_BUG?
Today I verified that turning off HT solves this problem on Piranha (elm3c175) as well. We are done as far as this bug is concerned. The results that Kiran has put together with a modified test case are interesting in understand how this hardware works, but they have nothing to do with this bug directly. Hence we are going to close this bug. I am rejecting this bug as ALT_SOLUTION_AVAIL, the alternative solution being turning off HT on Nehalem based machines.