Bug 484209
Summary: | [DESTINY] Matrix_mult fails on (x3550 M2) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | IBM Bug Proxy <bugproxy> | ||||
Component: | realtime-kernel | Assignee: | Red Hat Real Time Maintenance <rt-maint> | ||||
Status: | CLOSED NOTABUG | QA Contact: | David Sommerseth <davids> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 1.1 | CC: | bhu, ovasik, pradeepkumars, williams | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-09-12 19:20:00 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
IBM Bug Proxy
2009-02-05 13:20:35 UTC
I am working on Vernon's observations that the test performs much worse at 9 cpus than at 8 cpus. With 9 cpus, concurrent average is usually around 5.3, whereas with 8 cpus it is around 8.5. This is a HUGE drop in performance when the 9th cpu is added. The following analysis was done with the system booted with maxcpus=9. This enables all cores of the system and 1 HT. In the concurrent part, we measure the time at the beginning and end of concurrent operations from main_thread() function. We don't use the average time spent by individual threads as this comment clearly indicates: cavg = delta/iterations; /* don't use the stats record, use the total time recorded */ Out of curiosity, I printed the average concurrent time taken calculated as an average of time taken by individual threads: printf("Avg from stats = %.4f us\n", stats_avg(&cdat)/numcpus); This turned out to be far lower than cavg as calculated by the test. For example, a typical 9-cpu run shows: Avg: 18321.0000 us Avg from stats = 11957.8184 us In fact using the second value makes the test PASS :-) This indicates either of the following: 1) There is some error in calculating concurrent time spent by main_thread() 2) One of the threads take a significantly more time than the rest of the threads. This pulls up the concurrent average in both cases, but it affects time calculation as it is now much more, because main_thread() waits till all threads are finished before recording time. More later. Okay, here is more interesting stuff: I am still running with maxcpus=9, so it has enabled: all 4 cores on first socket all 4 cores on second socket 1 thread on first core of first socket. This is logical cpu 8. I printed out the time taken by each concurrent multiplier by making some changes to the test case. I also print the cpuid and index. This index is the index in cdat.records[] that corresponds to this particular thread. This will help me figure out which thread is running on the second hardware thread. Running concurrent operations cpuid = 0, Index = 0 cpuid = 1, Index = 15 cpuid = 2, Index = 60 cpuid = 3, Index = 75 cpuid = 4, Index = 105 cpuid = 5, Index = 120 cpuid = 6, Index = 30 cpuid = 7, Index = 45 cpuid = 8, Index = 90 <=== So data at index 90 is from the second thread Thread 0: tsum = 2475710, Avg = 165047 us <=== High value Thread 15: tsum = 1368897, Avg = 91259 us Thread 30: tsum = 1368771, Avg = 91251 us Thread 45: tsum = 1371082, Avg = 91405 us Thread 60: tsum = 1368681, Avg = 91245 us Thread 75: tsum = 1368900, Avg = 91260 us Thread 90: tsum = 2475534, Avg = 165035 us <=== High value Thread 105: tsum = 1369101, Avg = 91273 us Thread 120: tsum = 1368814, Avg = 91254 us Min: 91079 us Max: 167756 us Avg: 18339.0000 us Avg from stats = 11963.3662 us StdDev: 30672.6387 us The above sequence repeats very consistently. Two threads always report very high numbers: thread 0 and whichever thread is on the second hardware thread. This possibly means HT is involved. I need to confirm by running on 16 cpus and such. In case there is any interest in this debug version of test case, I have put it here: http://kernel.beaverton.ibm.com/~sripathi/51189/matrix_mult_debug_v1.c I booted the system with maxcpus=12 cpus. From /proc/cpuinfo, I determined that this give us the following pattern: cpu0, core0: 0, 8 cpu0, core1: 2, 10 cpu0, core2: 4 cpu0, core3: 6 cpu1, core0: 1, 9 cpu1, core1: 3, 11 cpu1, core2: 5 cpu1, core3: 7 With this, I observed that cpus 4, 5, 6 and 7 always report noticeably better (lower) numbers compared to the other cpus. This is now pointing at HT being the culprit! Notice the numbers below. This pattern repeats every time. Running concurrent operations cpuid = 0, Index = 0 cpuid = 1, Index = 11 cpuid = 2, Index = 44 cpuid = 3, Index = 66 cpuid = 4, Index = 88 cpuid = 5, Index = 110 cpuid = 6, Index = 22 cpuid = 7, Index = 55 cpuid = 8, Index = 33 cpuid = 9, Index = 77 cpuid = 10, Index = 99 cpuid = 11, Index = 121 Thread 0: tsum = 1788118, Avg = 162556 us Thread 11: tsum = 1818666, Avg = 165333 us Thread 22: tsum = 1004678, Avg = 91334 us <== cpu 6 Thread 33: tsum = 1784692, Avg = 162244 us Thread 44: tsum = 1816586, Avg = 165144 us Thread 55: tsum = 1004661, Avg = 91332 us <== cpu 7 Thread 66: tsum = 1814128, Avg = 164920 us Thread 77: tsum = 1818686, Avg = 165335 us Thread 88: tsum = 1004710, Avg = 91337 us <== cpu 4 Thread 99: tsum = 1816600, Avg = 165145 us Thread 110: tsum = 1004564, Avg = 91324 us <== cpu 5 Thread 121: tsum = 1814040, Avg = 164912 us I tested this with maxcpus=15. With this, the only cpu which did not have it's HT enabled, cpu7, turned up much better numbers than all other cpus. I then enabled all 16 cpus on the system (All threads enabled). With this, I see very consistent numbers across all cpus. Also, as a result, I see that the two ways of calculating averages return near identical numbers: Running concurrent operations cpuid = 0, Index = 0 cpuid = 1, Index = 8 cpuid = 2, Index = 32 cpuid = 3, Index = 48 cpuid = 4, Index = 64 cpuid = 5, Index = 88 cpuid = 7, Index = 24 cpuid = 8, Index = 72 cpuid = 6, Index = 112 cpuid = 9, Index = 96 cpuid = 10, Index = 120 cpuid = 11, Index = 16 cpuid = 12, Index = 40 cpuid = 13, Index = 56 cpuid = 14, Index = 80 cpuid = 15, Index = 104 Thread 0: tsum = 1327000, Avg = 165875 us Thread 8: tsum = 1320238, Avg = 165029 us Thread 16: tsum = 1320609, Avg = 165076 us Thread 24: tsum = 1327789, Avg = 165973 us Thread 32: tsum = 1321766, Avg = 165220 us Thread 40: tsum = 1323060, Avg = 165382 us Thread 48: tsum = 1320673, Avg = 165084 us Thread 56: tsum = 1322942, Avg = 165367 us Thread 64: tsum = 1323120, Avg = 165390 us Thread 72: tsum = 1326753, Avg = 165844 us Thread 80: tsum = 1317434, Avg = 164679 us Thread 88: tsum = 1322985, Avg = 165373 us Thread 96: tsum = 1320192, Avg = 165024 us Thread 104: tsum = 1327635, Avg = 165954 us Thread 112: tsum = 1317665, Avg = 164708 us Thread 120: tsum = 1321815, Avg = 165226 us Min: 164170 us Max: 168977 us Avg: 10375.0000 us <====== Avg from stats = 10332.8496 us <====== StdDev: 1133.8242 us Notice that the two averages are almost equal. I feel this suggests that the way the test case calculates times and averages is alright. Now, some conclusions: ===================== 1) The problem surely needs to be related to Hyper Threading. 2) Interesting observation looking at numbers: a) From comment #32, with maxcpus=9, among individual cpus, high values were in 247xxxx range, whereas low values were in 136xxxx range. b) From comment #33, with maxcpus=15, among individual cpus, high values were in 181xxxx range, whereas low values were in 100xxxx range. c) From comment #34, with maxcpus=16, among individual cpus, all values were near 132xxxx. This seems to indicate that enabling more and more threads is improving the performance on each core. That is, the worst case performance seems to be improving with more and more threads being enabled. I am not sure whether this can be attributed to something in the hardware or something in the scheduler design. What next: ========== 1) We need to test with the latest available kernels. I guess the behavior hasn't changed. 2) I will chat with some scheduler folks tomorrow to see if they think this needs to be looked at from scheduler point of view. 3) If the above two go as predicted, we will conclude that disabling HT is the way to resolve this problem. Sripathi, excellent analysis. This is what we expected to begin with, however I am curious why the earlier tests of using taskset to use the first 4 cpus and the first 4 threads, leaving the last 4 cores completely unused, didn't result in higher numbers as well. I the system perhaps ignoring our request and distributing the threads to real cpus first, and then to threads despite the taskset setting? Also, according to Dinakar, the Intel specs suggest a 20-30% improvement with threads for *certain* workloads. You observed a single cpu take ~90ms while with both hardware threads it would take ~160ms. So what a single cpu can do in 180ms (90ms*2), a threaded CPU can do in 160ms. (180-160)/180 = 11% performance improvement. Given the variance in benchmarks and the cpu intensive nature of this one, I think an 11% performance improvement is pretty well inline with what we should expect. (In reply to comment #36) > Sripathi, excellent analysis. This is what we expected to begin with, however > I am curious why the earlier tests of using taskset to use the first 4 cpus and > the first 4 threads, leaving the last 4 cores completely unused, didn't result > in higher numbers as well. I the system perhaps ignoring our request and > distributing the threads to real cpus first, and then to threads despite the > taskset setting? 1) taskset -c 0,1,2,3,4,5,6,7 chooses first threads of all cores of both sockets in the system. 2) taskset -c 0,2,4,6,8,10,12,14 chooses all cores and threads on 1st socket and nothing from second socket. 3) taskset -c 1,3,5,7,9,11,13,15 chooses all cores and threads on 2nd socket and nothing from first socket. Thinking about this now, the test should have FAILed with options 2 and 3 above. I suspect it has got to do with how we do cpu affinity in the test case. I'll keep this in mind when I look at the test case again today. We had a discussion with Vaidy and Bharata here today about the part played by hardware and scheduler in this problem. We are curious about why the worst case performance on any core improves as more and more threads are enabled in other cores. Kiran is going to do the following to further understand what is going on here: 1) Test with maxcpus=8 and disabling HT in BIOS and compare the two results. We are not sure whether the unused threads in cores can in some way disrupt functioning of active thread. This test will confirm it. 2) Confirm that the setaffinity calls in the test are really working and there are no ping-pong effects. 3) Verify whether the method used for data collection in my debug code is correct. 4) Try to modify the test case to avoid any shared memory references between different threads. Currently result array is shared across threads, which can lead to false cache line sharing. Just another data point: The behaviour of the test is exactly same on 2.6.29-rc4-rt2-tip. Node : elm3c175 Machine : HS22 Piranha OS : SLERT10GMC --- Running testcase matrix_mult -c 0.75 --- Wed Feb 18 18:42:30 CST 2009 Logging to /test/gowri/slert/linux-rt-tests/ltp/testcases/realtime/logs/elm3c175-x86_64-2.6.22.19-0.14-rt-2009-18-02-matrix_mult-c0.75.log --------------------------------------- Matrix Multiplication (SMP Performance) --------------------------------------- Running 128 iterations Matrix Dimensions: 100x100 Calculations per iteration: 8 Number of CPUs: 16 Running sequential operations Min: 58035 us Max: 58520 us Avg: 58095.0000 us StdDev: 86.2324 us Running concurrent operations Min: 98162 us Max: 102970 us Avg: 6206.0000 us StdDev: 1374.6295 us Concurrent Multipliers: Min: 0.5912 Max: 0.5683 Avg: 9.3611 Criteria: 12.00 * average concurrent time < average sequential time Result: FAIL Wed Feb 18 18:42:39 CST 2009 The matrix_mult test appears to have completed. ------------------------------------------------------------------------------------------ This test also fails on the same platform with MRG 101rt kernel ------------------------------------------------------------------------------------------ OS : MRG101rt --- Running testcase matrix_mult -c 0.75 --- Fri Feb 13 19:27:59 EST 2009 Logging to /test/amrith/101/linux-rt-tests/ltp/testcases/realtime/logs/elm3c175-x86_64-2.6.24.7-101.el5rt-2009-13-02-matrix_mult-c0.75.log --------------------------------------- Matrix Multiplication (SMP Performance) --------------------------------------- Running 128 iterations Matrix Dimensions: 100x100 Calculations per iteration: 8 Number of CPUs: 16 Running sequential operations Min: 58302 us Max: 58751 us Avg: 58351.0000 us StdDev: 78.8359 us Min: 97634 us Max: 101083 us Avg: 6202.0000 us StdDev: 751.6913 us Concurrent Multipliers: Min: 0.5971 Max: 0.5812 Avg: 9.4084 Criteria: 12.00 * average concurrent time < average sequential time Result: FAIL Fri Feb 13 19:28:07 EST 2009 The matrix_mult test appears to have completed. Created attachment 332525 [details]
Piranha (HS22) system info
I was playing around with 2.6.29-rc6-tip kernel for bug 51340 and I collected some statistics using oprofile and performance counters (Ingo's new stuff) on this machine. The reported matrix_mult scalability problem can be recreated on mainline kernel as well, so the numbers I have obtained are relevant for this bug too. I ran the test with 8 and 9 cpus and compared the results. The only thing that interested me was the number of "cache references" reported by Ingo's perfstat. This number was significantly higher in the case of 9 cpus as compared to 8 cpus. Example: 8 cpus ---------- 5915822.044589 task clock ticks (msecs) 33382919396 instructions (events) 864042 cache references (events) <========= 120540 cache misses (events) 1206357996 branches (events) 12170749 branch misses (events) 22310884840 bus cycles (events) 1504 pagefaults (events) 7 context switches (events) 5 CPU migrations (events) Wall-clock time elapsed: 8397.334568 msecs 9 cpus ------- 244423.718692 task clock ticks (msecs) 32923054623 instructions (events) 5398471 cache references (events) <========= 116694 cache misses (events) 1190252003 branches (events) 11485947 branch misses (events) 23563660280 bus cycles (events) 1503 pagefaults (events) 5 context switches (events) 5 CPU migrations (events) Cache references go up by an order of magnitude, but cache misses actually go down.... hrmm.. are we confident that the data collected is correct? (In reply to comment #46) > Cache references go up by an order of magnitude, but cache misses actually go > down.... hrmm.. are we confident that the data collected is correct? The procedure to use perfstat.c was pretty simple, so I am unlikely to have gone wrong with that. I don't know how accurate perfstat is, though. I did a couple of tests as mentioned by Sripathi in comment #40. 1) I tested with ht enabled and maxcpus=8. The problem does not occur. I repeated it with ht disabled and the results are similar to the case when ht is enabled. The problem does not occur. This confirms that unused threads in the cores do not disrupt the functioning of active thread. 2) I booted the system with maxcpus=9. I printed out the cpu ids on which the threads are running and these ids remain the same throughout. Also checked the output of top -H. There's a uniform load on the threads throughout and no sudden changes in the load. Thus it can be confirmed that setaffinity calls are working fine and there is no ping ponging of the threads. 3) I modified the test case to avoid any shared memory references between different threads. I created a separate results array for each thread to hold the results so the threads do not access the shared array. The test case fails in this case as well but the results are much better than the other results. The results for maxcpus=9 is Running concurrent operations cpuid = 0, Index = 0 cpuid = 1, Index = 30 cpuid = 2, Index = 45 cpuid = 3, Index = 60 cpuid = 4, Index = 75 cpuid = 5, Index = 90 cpuid = 6, Index = 105 cpuid = 7, Index = 120 cpuid = 8, Index = 15 Thread 7: tsum = 95933, Avg = 6395 us Thread 3: tsum = 95919, Avg = 6394 us Thread 4: tsum = 95944, Avg = 6396 us Thread 5: tsum = 95940, Avg = 6396 us Thread 6: tsum = 95950, Avg = 6396 us Thread 2: tsum = 95933, Avg = 6395 us Thread 1: tsum = 95939, Avg = 6395 us Thread 8: tsum = 169954, Avg = 11330 us Thread 0: tsum = 170074, Avg = 11338 us Concurrent Multipliers Avg: 5.2216 Is there anything else we want to try and do here? The only thing I can think of is to adjust our test cases to account for HT type CPUs, but I don't think it's really worth it as we can just turn it off in the bios. Can we close this out as NOT_A_BUG? Today I verified that turning off HT solves this problem on Piranha (elm3c175) as well. We are done as far as this bug is concerned. The results that Kiran has put together with a modified test case are interesting in understand how this hardware works, but they have nothing to do with this bug directly. Hence we are going to close this bug. I am rejecting this bug as ALT_SOLUTION_AVAIL, the alternative solution being turning off HT on Nehalem based machines. |