Bug 484209 - [DESTINY] Matrix_mult fails on (x3550 M2)
[DESTINY] Matrix_mult fails on (x3550 M2)
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
1.1
x86_64 All
low Severity medium
: ---
: ---
Assigned To: Red Hat Real Time Maintenance
David Sommerseth
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-02-05 08:20 EST by IBM Bug Proxy
Modified: 2016-05-22 19:27 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-09-12 15:20:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Piranha (HS22) system info (94.14 KB, application/octet-stream)
2009-02-19 05:40 EST, IBM Bug Proxy
no flags Details

  None (edit)
Description IBM Bug Proxy 2009-02-05 08:20:35 EST
=Comment: #0=================================================
Amrith Anandan <amrith.anandan@in.ibm.com> - 

OS: MRG
H/W : (x3650 M2) Victory

--- Running testcase matrix_mult -c 0.75  ---
Fri Jan  9 17:24:33 PST 2009
Logging to
/home/cheezo/linux-rt-tests/ltp/testcases/realtime/logs/elm3b120-x86_64-2.6.24.7-94.el5rt-2009-09-01-matrix_mult-c0.75.log

---------------------------------------
Matrix Multiplication (SMP Performance)
---------------------------------------

Running 128 iterations
Matrix Dimensions: 100x100
Calculations per iteration: 8
Number of CPUs: 16

Running sequential operations
Min: 58274 us
Max: 58503 us
Avg: 58304.0000 us
StdDev: 49.1937 us
Min: 96340 us
Max: 101415 us
Avg: 6220.0000 us
StdDev: 974.8007 us

Concurrent Multipliers:
Min: 0.6049
Max: 0.5769
Avg: 9.3736

Criteria: 12.00 * average concurrent time < average sequential time
Result: FAIL

Fri Jan  9 17:24:41 PST 2009
The matrix_mult test appears to have completed.

=Comment: #1=================================================
Amrith Anandan <amrith.anandan@in.ibm.com> - 
16x2.67 GHz (2 Socket, Intel Xeon Quad Core 2 threads/core Nehalem) 

=Comment: #3=================================================
Dinakar Guniguntala <dino@in.ibm.com> - 
So I was going through the Intel Nehalem manuals today and it looks like with
HT turned on, we can expect a perf improvement in the range of 20% - 30% given
the right application. The results indicate ~5% improvement in performance,
which is expected as matrix mult is pretty cpu intensive.

So I think we can do one of 2 things
1. Turn off HT in the bios and rerun the tests. Matrix mult should pass in this
case
2. Update matrix mult testcase to check for threading and have a lower
threshold.

I think we should go with 1 for now to confirm if HT is the issue.

=Comment: #5=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
Initial observations:

This machine has 16 virtual cpus, of which:
2 threads per cpu
4 cores per socket
2 sockets.

Test run as-is: FAIL.

Then I changed the test case to make numcpus=8 (it is sysconf(_SC_NPROCESSORS_ONLN) by default).
That is, it will only run on 8 cpus. I then ran it on different combination of 8 cpus.

taskset -c 1,3,5,7,9,11,13,15 ./matrix_mult : PASS  (on 1st thread of every cpu core)
taskset -c 0,2,4,6,8,10,12,14 ./matrix_mult : PASS  (on 2nd thread of every cpu core)
taskset -c 0,1,2,3,4,5,6,7 ./matrix_mult : PASS  (on first 8 cpus, which are all on the same socket)

So it looks like this has got to do with the number of cpus being increased to 16. I will do the
following:

1) Compare the results of above runs
2) Look closely at the test case to see whether it needs changes to run on 16 cpus.

=Comment: #6=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
OK, here is some more analysis.

With modified test case that spawns only 8 threads:

taskset -c 1,3,5,7,9,11,13,15 ./matrix_mult
Concurrent multipliers: Avg: 8.4955 (It was always very close to 8.5)

taskset -c 0,2,4,6,8,10,12,14 ./matrix_mult
Concurrent multipliers: Avg: 8.4969 (Again, very close to 8.5)

taskset -c 0,1,2,3,4,5,6,7 ./matrix_mult
Concurrent Multipliers: Avg: 8.4936 (Same as the above two)

So the above suggests that HT does not have much effect in this configuration/testcase. Also, the
last test suggests that the problem may not be due to cache line bouncing between the two cpus, but
due to contention for some locks as the number of cpus  increases.

Also, with the default 16 threads, increasing the load (-l) up to 128 did not seem to make any
difference on the numbers. I wonder if that option works at all.

=Comment: #7=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
Yet another interesting data point. I deployed another 16 cpu x86_64 machine (elm3a176). This
machine has  4 sockets, dual core, 2 threads per core. Even on this machine, matrix_mult failed.
Concurrent multiplier average was in the range of ~ 9.9 on this machine. Interestingly, running with
8 threads on 8 cpus also fails on this machine!

Ideally I should have tried it on a 16 cpu machine with no HT. However, all data so far is pointing
towards some lock contention on high number of cpus.

Next step is to oprofile and see what it shows up. Before that, I wish to try a few things on bug
51071 (Calibrate problem) to see if that too exhibits similar problems.

=Comment: #8=================================================
Dinakar Guniguntala <dino@in.ibm.com> - 
> So the above suggests that HT does not have much effect in this
> configuration/testcase. Also, the last test suggests that the problem may not
> be due to cache line bouncing between the two cpus, but due to contention for
> some locks as the number of cpus  increases.

I would suggest turning off HT in bios and then running your tests before you make
any conclusions here. That is definitely a data point we need to figure out what exactly
is going on 

=Comment: #9=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
I booted elm3a112 (Victory) with maxcpus=8. This essentially shuts off the 8 virtual cpus in the
second socket. With this, I have seen the test always pass.

=Comment: #10=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 
> I booted elm3a112 (Victory) with maxcpus=8. This essentially shuts off the 8
> virtual cpus in the second socket. With this, I have seen the test always pass.

That is weird, you also said:

> taskset -c 0,1,2,3,4,5,6,7 ./matrix_mult : PASS  (on first 8 cpus, which are
> all on the same socket)

So both of those statements actually argue against HT being the problem, since even at 8way we'd
expect problems if we're counting hardware threads as CPUs.  And the concurrent multipliers didn't
seems to change between runs on completely separate cores vs. those using the hardware threads.

=Comment: #11=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
Yes, all my testing so far has suggested that the problem could more likely be due to increased
number of cpus, rather than HT.

=Comment: #12=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
I tried to use oprofile on this machine to understand where all the time is being spent. However,
oprofile doesn't seem to work on this machine with the error "opreport error: No sample file found:
try running opcontrol --dump or specify a session containing sample files". We have seen such bugs
in the past (bug 41292, bug 43249, bug 45219). The error probably means the new Nehalem cpu isn't
supported by the MRG kernel.

I then passed nmi_watchdog=1 to the kernel and this got oprofile working. However, IIRC this means
oprofile is using a low sampling rate, which makes it not very useful.

I searched in Linus' kernel git tree for the string "Nehalem" and found some patches. We might need
to backport some of these patches to get oprofile working on this machine. I will raise a new bug
for this.

=Comment: #14=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
Going through the matrix_mult test case, I feel this should be a reasonable test even on a non-rt
system. I ran the test on vanilla kernels (RHEL5.2 kernel, 2.6.28 based kernel) and sure enough, it
passes, though with a slightly lower performance compared to real-time kernels. I then ran 2.6.28.2
kernel on Victory (elm3a112) and saw that the test fails. It produces numbers very close to the MRG
kernels.

This is another indication that the problem here may be due to inability of the test to perform well
on a machine with high number of cpus. The problem could be with the test case too!

I have gotten oprofile to work on 2.6.28.2 kernel, not yet on MRG kernel. Please see bug 51340 for
details about that. I will run the test through oprofile and see if I can gather more information.

=Comment: #15=================================================
Dinakar Guniguntala <dino@in.ibm.com> - 
These are Destiny systems (x3550 M2) and not Victory (x3650 M2)

=Comment: #16=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
Using Kiran's work in bug 51340, we got a reasonably working oprofile on MRG kernels. I say
reasonably because oprofile only seems to work once. The second time I try to start it, the system
crashes/hangs. Highlights from the profile look like this:

opreport --long-filenames
------------
CPU: Intel Architectural Perfmon, speed 1596 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit
mask) count 100000
CPU_CLK_UNHALT...|
  samples|      %|
------------------
   636645 88.3410 /usr/src/redhat/BUILD/kernel-rt-2.6.24.7-100/linux-2.6.24.7.x86_64_kiran/vmlinux
    72667 10.0833 /home/rtuser/ltp-full-20081231/testcases/realtime/func/matrix_mult/matrix_mult
     2412  0.3347 /lib64/libc-2.5.so
     2220  0.3080 /usr/local/bin/oprofiled
     1566  0.2173 /ehci_hcd
    <snip>

opreport of vmlinux
--------------------------
CPU: Intel Architectural Perfmon, speed 1596 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit
mask) count 100000
samples  %        symbol name
160248   24.5952  read_hpet
45880     7.0418  apic_timer_interrupt
42684     6.5512  find_busiest_group
35257     5.4113  __schedule
34376     5.2761  acpi_safe_halt
30016     4.6069  cpupri_set
12804     1.9652  source_load
<snip>

opreport of matrix_mult
-------------------------------
CPU: Intel Architectural Perfmon, speed 1596 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit
mask) count 100000
samples  %        symbol name
71561    98.4780  matrix_mult
1098      1.5110  matrix_init
7         0.0096  calibrate_busyloop
1         0.0014  concurrent_thread

I will attach the output of "opannotate --assembly --source <vmlinux>" to this bug.

=Comment: #17=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 
If I'm reading the oprofile output correctly, we're spending we have an order of magnitude more
samples in the kernel than we do in userspace, and 24% of the time in the kernel is spent reading
the hpet?  The userspace side looks on track, with 98+% spent running the test.  How does this
compare to when we are running with only 8 CPUs?

=Comment: #18=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
opannotate output was too big to be attached. I have put it here:
http://kernel.beaverton.ibm.com/~sripathi/51189/opannotate_vmlinux.txt

=Comment: #19=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
As another data point, I disabled HT in the BIOS. This means we will have only 8 cpus. With this,
matrix_mult passes as expected, with a gain of around 8.5 against expected 6:

Running sequential operations
Min: 97205 us
Max: 97358 us
Avg: 97222.0000 us
StdDev: 15.4834 us
Min: 91079 us
Max: 93399 us
Avg: 11447.0000 us
StdDev: 518.4039 us

Concurrent Multipliers:
Min: 1.0673
Max: 1.0424
Avg: 8.4932

The numbers are almost identical to the test being run on just 8 cpus using taskset as mentioned in
comment #5. With this, we can safely conclude that the problem doesn't have anything to do with
Hyper Threading, but to do with the increased number of cpus.

We will now have to treat this as a generic scalability problem.

=Comment: #20=================================================
Darren V. Hart <dvhltc@us.ibm.com> - 
> --- Comment #19 from Sripathi Kodi <sripathik@in.ibm.com>  2009-02-02 11:27:16 EST ---
> As another data point, I disabled HT in the BIOS. This means we will have only
> 8 cpus. With this, matrix_mult passes as expected, with a gain of around 8.5
> against expected 6:
> 
> Running sequential operations
> Min: 97205 us
> Max: 97358 us
> Avg: 97222.0000 us
> StdDev: 15.4834 us
> Min: 91079 us
> Max: 93399 us
> Avg: 11447.0000 us
> StdDev: 518.4039 us
> 
> Concurrent Multipliers:
> Min: 1.0673
> Max: 1.0424
> Avg: 8.4932

One wonders how 8 CPUs can do 8.5 the work of 1 CPU... I suppose the 1 CPU may take more overhead
that doesn't scale up with CPUs, so if CPU intensive tasks are well scheduled, they should be able
to attain a slightly higher concurrency factor than a simple NR_CPUS multiply would provide under
the right conditions.

> The numbers are almost identical to the test being run on just 8 cpus using
> taskset as mentioned in comment #5. With this, we can safely conclude that the
> problem doesn't have anything to do with Hyper Threading, but to do with the
> increased number of cpus.
> 
> We will now have to treat this as a generic scalability problem.

Agreed

=Comment: #21=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 
(In reply to comment #17)
> If I'm reading the oprofile output correctly, we're spending we have an order
> of magnitude more samples in the kernel than we do in userspace, and 24% of the
> time in the kernel is spent reading the hpet?  The userspace side looks on
> track, with 98+% spent running the test.  How does this compare to when we are
> running with only 8 CPUs?

I tried this on an HS21XM today (elm3c190) and the results are pretty interesting. vmlinux doesn't
show up as much as it does on Nehalem machines:

[root@elm3c190 profile]# head opreport_long 
CPU: Core 2, speed 2833.42 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted
core cycles) count 100000
CPU_CLK_UNHALT...|
  samples|      %|
------------------
    53250 62.0572 /home/rtuser/ltp-full-20081231/testcases/realtime/func/matrix_mult/matrix_mult
    25447 29.6557 /usr/lib/debug/lib/modules/2.6.24.7-100.el5rt/vmlinux
     4111  4.7909 /vmlinux-unknown
     1905  2.2201 /ipmi_si
      404  0.4708 /lib64/libc-2.5.so

[root@elm3c190 profile]# head opreport_matrix_mult 
CPU: Core 2, speed 2833.42 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted
core cycles) count 100000
samples  %        symbol name
52465    98.5258  matrix_mult
777       1.4592  matrix_init
7         0.0131  calibrate_busyloop
1         0.0019  buffer_init

[root@elm3c190 profile]# head opreport_vmlinux 
CPU: Core 2, speed 2833.42 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted
core cycles) count 100000
samples  %        symbol name
5156     20.2617  mwait_idle
1809      7.1089  find_busiest_group
1468      5.7689  source_load
1466      5.7610  read_tsc
1216      4.7786  __schedule
1107      4.3502  cpupri_set
998       3.9219  apic_timer_interrupt

Also, read_hpet is nowhere to be seen in the full list of vmlinux.

Assuming the oprofile results seen on Nehalem so far can be trusted, this surely points to some
bottleneck in the kernel.

=Comment: #25=================================================
Sripathi Kodi <sripathik@in.ibm.com> - 

Note to Red Hat:
---------------------
We are suspecting this bug to be showing problems with RT kernel scaling to 16 cpus. It is unlikely
to be a problem with HT.
Comment 1 IBM Bug Proxy 2009-02-16 08:17:31 EST
I am working on Vernon's observations that the test performs much worse at 9 cpus than at 8 cpus. With 9 cpus, concurrent average is usually around 5.3, whereas with 8 cpus it is around 8.5. This is a HUGE drop in performance when the 9th cpu is added.

The following analysis was done with the system booted with maxcpus=9. This enables all cores of the system and 1 HT.

In the concurrent part, we measure the time at the beginning and end of concurrent operations from main_thread() function. We don't use the average time spent by individual threads as this comment clearly indicates:

cavg = delta/iterations; /* don't use the stats record, use the total time recorded */

Out of curiosity, I printed the average concurrent time taken calculated as an average of time taken by individual threads:

printf("Avg from stats = %.4f us\n", stats_avg(&cdat)/numcpus);

This turned out to be far lower than cavg as calculated by the test. For example, a typical 9-cpu run shows:

Avg: 18321.0000 us
Avg from stats = 11957.8184 us

In fact using the second value makes the test PASS :-)

This indicates either of the following:
1) There is some error in calculating concurrent time spent by main_thread()
2) One of the threads take a significantly more time than the rest of the threads. This pulls up the concurrent average in both cases, but it affects time calculation as it is now much more, because main_thread() waits till all threads are finished before recording time.

More later.


Okay, here is more interesting stuff:

I am still running with maxcpus=9, so it has enabled:
all 4 cores on first socket
all 4 cores on second socket
1 thread on first core of first socket. This is logical cpu 8.

I printed out the time taken by each concurrent multiplier by making some changes to the test case. I also print the cpuid and index. This index is the index in cdat.records[] that corresponds to this particular thread. This will help me figure out which thread is running on the second hardware thread.

Running concurrent operations
cpuid = 0, Index = 0
cpuid = 1, Index = 15
cpuid = 2, Index = 60
cpuid = 3, Index = 75
cpuid = 4, Index = 105
cpuid = 5, Index = 120
cpuid = 6, Index = 30
cpuid = 7, Index = 45
cpuid = 8, Index = 90  <=== So data at index 90 is from the second thread
Thread 0: tsum = 2475710, Avg = 165047 us  <=== High value
Thread 15: tsum = 1368897, Avg = 91259 us
Thread 30: tsum = 1368771, Avg = 91251 us
Thread 45: tsum = 1371082, Avg = 91405 us
Thread 60: tsum = 1368681, Avg = 91245 us
Thread 75: tsum = 1368900, Avg = 91260 us
Thread 90: tsum = 2475534, Avg = 165035 us  <=== High value
Thread 105: tsum = 1369101, Avg = 91273 us
Thread 120: tsum = 1368814, Avg = 91254 us

Min: 91079 us
Max: 167756 us
Avg: 18339.0000 us
Avg from stats = 11963.3662 us
StdDev: 30672.6387 us

The above sequence repeats very consistently. Two threads always report very high numbers: thread 0 and whichever thread is on the second hardware thread. This possibly means HT is involved. I need to confirm by running on 16 cpus and such.

In case there is any interest in this debug version of test case, I have put it here: http://kernel.beaverton.ibm.com/~sripathi/51189/matrix_mult_debug_v1.c
Comment 2 IBM Bug Proxy 2009-02-16 09:41:03 EST
I booted the system with maxcpus=12 cpus. From /proc/cpuinfo, I determined that this give us the following pattern:

cpu0, core0: 0, 8
cpu0, core1: 2, 10
cpu0, core2: 4
cpu0, core3: 6

cpu1, core0: 1, 9
cpu1, core1: 3, 11
cpu1, core2: 5
cpu1, core3: 7

With this, I observed that cpus 4, 5, 6 and 7 always report noticeably better (lower) numbers compared to the other cpus. This is now pointing at HT being the culprit!

Notice the numbers below. This pattern repeats every time.

Running concurrent operations
cpuid = 0, Index = 0
cpuid = 1, Index = 11
cpuid = 2, Index = 44
cpuid = 3, Index = 66
cpuid = 4, Index = 88
cpuid = 5, Index = 110
cpuid = 6, Index = 22
cpuid = 7, Index = 55
cpuid = 8, Index = 33
cpuid = 9, Index = 77
cpuid = 10, Index = 99
cpuid = 11, Index = 121
Thread 0: tsum = 1788118, Avg = 162556 us
Thread 11: tsum = 1818666, Avg = 165333 us
Thread 22: tsum = 1004678, Avg = 91334 us <== cpu 6
Thread 33: tsum = 1784692, Avg = 162244 us
Thread 44: tsum = 1816586, Avg = 165144 us
Thread 55: tsum = 1004661, Avg = 91332 us <== cpu 7
Thread 66: tsum = 1814128, Avg = 164920 us
Thread 77: tsum = 1818686, Avg = 165335 us
Thread 88: tsum = 1004710, Avg = 91337 us <== cpu 4
Thread 99: tsum = 1816600, Avg = 165145 us
Thread 110: tsum = 1004564, Avg = 91324 us <== cpu 5
Thread 121: tsum = 1814040, Avg = 164912 us
Comment 3 IBM Bug Proxy 2009-02-16 10:11:06 EST
I tested this with maxcpus=15. With this, the only cpu which did not have it's HT enabled, cpu7, turned up much better numbers than all other cpus.

I then enabled all 16 cpus on the system (All threads enabled). With this, I see very consistent numbers across all cpus. Also, as a result, I see that the two ways of calculating averages return near identical numbers:

Running concurrent operations
cpuid = 0, Index = 0
cpuid = 1, Index = 8
cpuid = 2, Index = 32
cpuid = 3, Index = 48
cpuid = 4, Index = 64
cpuid = 5, Index = 88
cpuid = 7, Index = 24
cpuid = 8, Index = 72
cpuid = 6, Index = 112
cpuid = 9, Index = 96
cpuid = 10, Index = 120
cpuid = 11, Index = 16
cpuid = 12, Index = 40
cpuid = 13, Index = 56
cpuid = 14, Index = 80
cpuid = 15, Index = 104
Thread 0: tsum = 1327000, Avg = 165875 us
Thread 8: tsum = 1320238, Avg = 165029 us
Thread 16: tsum = 1320609, Avg = 165076 us
Thread 24: tsum = 1327789, Avg = 165973 us
Thread 32: tsum = 1321766, Avg = 165220 us
Thread 40: tsum = 1323060, Avg = 165382 us
Thread 48: tsum = 1320673, Avg = 165084 us
Thread 56: tsum = 1322942, Avg = 165367 us
Thread 64: tsum = 1323120, Avg = 165390 us
Thread 72: tsum = 1326753, Avg = 165844 us
Thread 80: tsum = 1317434, Avg = 164679 us
Thread 88: tsum = 1322985, Avg = 165373 us
Thread 96: tsum = 1320192, Avg = 165024 us
Thread 104: tsum = 1327635, Avg = 165954 us
Thread 112: tsum = 1317665, Avg = 164708 us
Thread 120: tsum = 1321815, Avg = 165226 us
Min: 164170 us
Max: 168977 us
Avg: 10375.0000 us <======
Avg from stats = 10332.8496 us <======
StdDev: 1133.8242 us

Notice that the two averages are almost equal. I feel this suggests that the way the test case calculates times and averages is alright.


Now, some conclusions:
=====================

1) The problem surely needs to be related to Hyper Threading.

2) Interesting observation looking at numbers:

a) From comment #32, with maxcpus=9, among individual cpus, high values were in 247xxxx range, whereas low values were in 136xxxx range.
b) From comment #33, with maxcpus=15, among individual cpus, high values were in 181xxxx range, whereas low values were in 100xxxx range.
c) From comment #34, with maxcpus=16, among individual cpus, all values were near 132xxxx.

This seems to indicate that enabling more and more threads is improving the performance on each core. That is, the worst case performance seems to be improving with more and more threads being enabled. I am not sure whether this can be attributed to something in the hardware or something in the scheduler design.

What next:
==========
1) We need to test with the latest available kernels. I guess the behavior hasn't changed.
2) I will chat with some scheduler folks tomorrow to see if they think this needs to be looked at from scheduler point of view.
3) If the above two go as predicted, we will conclude that disabling HT is the way to resolve this problem.
Comment 4 IBM Bug Proxy 2009-02-16 12:01:31 EST
Sripathi, excellent analysis.  This is what we expected to begin with, however I am curious why the earlier tests of using taskset to use the first 4 cpus and the first 4 threads, leaving the last 4 cores completely unused, didn't result in higher numbers as well.  I the system perhaps ignoring our request and distributing the threads to real cpus first, and then to threads despite the taskset setting?

Also, according to Dinakar, the Intel specs suggest a 20-30% improvement with threads for *certain* workloads.  You observed a single cpu take ~90ms while with both hardware threads it would take ~160ms.  So what a single cpu can do in 180ms (90ms*2), a threaded CPU can do in 160ms.  (180-160)/180 = 11% performance improvement.  Given the variance in benchmarks and the cpu intensive nature of this one, I think an 11% performance improvement is pretty well inline with what we should expect.
Comment 5 IBM Bug Proxy 2009-02-17 00:40:42 EST
(In reply to comment #36)
> Sripathi, excellent analysis.  This is what we expected to begin with, however
> I am curious why the earlier tests of using taskset to use the first 4 cpus and
> the first 4 threads, leaving the last 4 cores completely unused, didn't result
> in higher numbers as well.  I the system perhaps ignoring our request and
> distributing the threads to real cpus first, and then to threads despite the
> taskset setting?

1) taskset -c 0,1,2,3,4,5,6,7 chooses first threads of all cores of both sockets in the system.
2) taskset -c 0,2,4,6,8,10,12,14 chooses all cores and threads on 1st socket and nothing from second socket.
3) taskset -c 1,3,5,7,9,11,13,15 chooses all cores and threads on 2nd socket and nothing from first socket.

Thinking about this now, the test should have FAILed with options 2 and 3 above. I suspect it has got to do with how we do cpu affinity in the test case. I'll keep this in mind when I look at the test case again today.
Comment 6 IBM Bug Proxy 2009-02-17 08:00:43 EST
We had a discussion with Vaidy and Bharata here today about the part played by hardware and scheduler in this problem. We are curious about why the worst case performance on any core improves as more and more threads are enabled in other cores.

Kiran is going to do the following to further understand what is going on here:

1) Test with maxcpus=8 and disabling HT in BIOS and compare the two results. We are not sure whether the unused threads in cores can in some way disrupt functioning of active thread. This test will confirm it.
2) Confirm that the setaffinity calls in the test are really working and there are no ping-pong effects.
3) Verify whether the method used for data collection in my debug code is correct.
4) Try to modify the test case to avoid any shared memory references between different threads. Currently result array is shared across threads, which can lead to false cache line sharing.
Comment 7 IBM Bug Proxy 2009-02-18 01:10:32 EST
Just another data point: The behaviour of the test is exactly same on 2.6.29-rc4-rt2-tip.
Comment 8 IBM Bug Proxy 2009-02-19 05:40:49 EST
Node : elm3c175
Machine : HS22 Piranha
OS : SLERT10GMC

--- Running testcase matrix_mult -c 0.75  ---
Wed Feb 18 18:42:30 CST 2009
Logging to /test/gowri/slert/linux-rt-tests/ltp/testcases/realtime/logs/elm3c175-x86_64-2.6.22.19-0.14-rt-2009-18-02-matrix_mult-c0.75.log

---------------------------------------
Matrix Multiplication (SMP Performance)
---------------------------------------

Running 128 iterations
Matrix Dimensions: 100x100
Calculations per iteration: 8
Number of CPUs: 16

Running sequential operations
Min: 58035 us
Max: 58520 us
Avg: 58095.0000 us
StdDev: 86.2324 us

Running concurrent operations
Min: 98162 us
Max: 102970 us
Avg: 6206.0000 us
StdDev: 1374.6295 us

Concurrent Multipliers:
Min: 0.5912
Max: 0.5683
Avg: 9.3611

Criteria: 12.00 * average concurrent time < average sequential time
Result: FAIL

Wed Feb 18 18:42:39 CST 2009
The matrix_mult test appears to have completed.

------------------------------------------------------------------------------------------
This test also fails on the same platform with MRG 101rt kernel
------------------------------------------------------------------------------------------

OS : MRG101rt

--- Running testcase matrix_mult -c 0.75  ---
Fri Feb 13 19:27:59 EST 2009
Logging to /test/amrith/101/linux-rt-tests/ltp/testcases/realtime/logs/elm3c175-x86_64-2.6.24.7-101.el5rt-2009-13-02-matrix_mult-c0.75.log

---------------------------------------
Matrix Multiplication (SMP Performance)
---------------------------------------

Running 128 iterations
Matrix Dimensions: 100x100
Calculations per iteration: 8
Number of CPUs: 16

Running sequential operations
Min: 58302 us
Max: 58751 us
Avg: 58351.0000 us
StdDev: 78.8359 us
Min: 97634 us
Max: 101083 us
Avg: 6202.0000 us
StdDev: 751.6913 us

Concurrent Multipliers:
Min: 0.5971
Max: 0.5812
Avg: 9.4084

Criteria: 12.00 * average concurrent time < average sequential time
Result: FAIL

Fri Feb 13 19:28:07 EST 2009
The matrix_mult test appears to have completed.
Comment 9 IBM Bug Proxy 2009-02-19 05:40:56 EST
Created attachment 332525 [details]
Piranha (HS22) system info
Comment 10 IBM Bug Proxy 2009-02-25 12:12:03 EST
I was playing around with 2.6.29-rc6-tip kernel for bug 51340 and I collected some statistics using oprofile and performance counters (Ingo's new stuff) on this machine. The reported matrix_mult scalability problem can be recreated on mainline kernel as well, so the numbers I have obtained are relevant for this bug too.

I ran the test with 8 and 9 cpus and compared the results. The only thing that interested me was the number of "cache references" reported by Ingo's perfstat. This number was significantly higher in the case of 9 cpus as compared to 8 cpus. Example:

8 cpus
----------
5915822.044589  task clock ticks     (msecs)

33382919396  instructions         (events)
864042  cache references     (events)  <=========
120540  cache misses         (events)
1206357996  branches             (events)
12170749  branch misses        (events)
22310884840  bus cycles           (events)
1504  pagefaults           (events)
7  context switches     (events)
5  CPU migrations       (events)

Wall-clock time elapsed:  8397.334568 msecs

9 cpus
-------

244423.718692  task clock ticks     (msecs)

32923054623  instructions         (events)
5398471  cache references     (events)  <=========
116694  cache misses         (events)
1190252003  branches             (events)
11485947  branch misses        (events)
23563660280  bus cycles           (events)
1503  pagefaults           (events)
5  context switches     (events)
5  CPU migrations       (events)
Comment 11 IBM Bug Proxy 2009-02-25 14:42:07 EST
Cache references go up by an order of magnitude, but cache misses actually go down.... hrmm.. are we confident that the data collected is correct?
Comment 12 IBM Bug Proxy 2009-02-26 00:50:41 EST
(In reply to comment #46)
> Cache references go up by an order of magnitude, but cache misses actually go
> down.... hrmm.. are we confident that the data collected is correct?

The procedure to use perfstat.c was pretty simple, so I am unlikely to have gone wrong with that. I don't know how accurate perfstat is, though.
Comment 13 IBM Bug Proxy 2009-02-26 07:00:54 EST
I did a couple of tests as mentioned by Sripathi in comment #40.

1) I tested with ht enabled and maxcpus=8. The problem does not occur.
I repeated it with ht disabled and the results are similar to the case when ht is enabled. The problem does not occur. This confirms that unused threads in the cores do not disrupt the functioning of active thread.

2) I booted the system with maxcpus=9. I printed out the cpu ids on which the threads are running and these ids remain the same throughout. Also checked the output of top -H. There's a uniform load on the threads throughout and no sudden changes in the load. Thus it can be confirmed that setaffinity calls are working fine and there is no ping ponging of the threads.

3) I modified the test case to avoid any shared memory references between different threads. I created a separate results array for each thread to hold the results so the threads do not access the shared array. The test case fails in this case as well but the results are much better than the other results.

The results for maxcpus=9 is

Running concurrent operations
cpuid = 0, Index = 0
cpuid = 1, Index = 30
cpuid = 2, Index = 45
cpuid = 3, Index = 60
cpuid = 4, Index = 75
cpuid = 5, Index = 90
cpuid = 6, Index = 105
cpuid = 7, Index = 120
cpuid = 8, Index = 15
Thread 7: tsum = 95933, Avg = 6395 us
Thread 3: tsum = 95919, Avg = 6394 us
Thread 4: tsum = 95944, Avg = 6396 us
Thread 5: tsum = 95940, Avg = 6396 us
Thread 6: tsum = 95950, Avg = 6396 us
Thread 2: tsum = 95933, Avg = 6395 us
Thread 1: tsum = 95939, Avg = 6395 us
Thread 8: tsum = 169954, Avg = 11330 us
Thread 0: tsum = 170074, Avg = 11338 us

Concurrent Multipliers
Avg: 5.2216
Comment 14 IBM Bug Proxy 2009-02-26 12:30:48 EST
Is there anything else we want to try and do here?  The only thing I can think of is to adjust our test cases to account for HT type CPUs, but I don't think it's really worth it as we can just turn it off in the bios.  Can we close this out as NOT_A_BUG?
Comment 15 IBM Bug Proxy 2009-02-27 08:03:40 EST
Today I verified that turning off HT solves this problem on Piranha (elm3c175) as well. We are done as far as this bug is concerned. The results that Kiran has put together with a modified test case are interesting in understand how this hardware works, but they have nothing to do with this bug directly. Hence we are going to close this bug.

I am rejecting this bug as ALT_SOLUTION_AVAIL, the alternative solution being turning off HT on Nehalem based machines.

Note You need to log in before you can comment on or make changes to this bug.