Bug 610297 - task-scheduler moves jobs between CPUs. linpack performance drop 7% in average, 17% worst case
task-scheduler moves jobs between CPUs. linpack performance drop 7% in averag...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
medium Severity medium
: rc
: ---
Assigned To: Johannes Weiner
Red Hat Kernel QE team
:
: 613476 (view as bug list)
Depends On:
Blocks: 573755
  Show dependency treegraph
 
Reported: 2010-07-01 20:01 EDT by Jiri Hladky
Modified: 2015-08-31 23:50 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-09-10 08:27:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Complete run data (545.50 KB, application/x-gzip)
2010-07-01 20:08 EDT, Jiri Hladky
no flags Details

  None (edit)
Description Jiri Hladky 2010-07-01 20:01:03 EDT
Description of problem:
-----------------------

We are testing performance of the task-scheduler on multi-cpu boxes. Testing methodology is: 

Do following on the box with M CPU cores
1) Start N<M parallel runs of linpack (floating point benchmark) and collect Flops reported by linpack. 
2) Start N<M parallel runs of linpack but this time each linpack is bound to exactly one CPU. We set CPU affinity by using taskset command. We carefully select the best possible subset of N cores from M cores available to get the best performance. 
3) Compare Flops reported by linpack in case 1) and 2). When taskscheduler is working properly then 2) = 1). When taskscheduler is doing poor job expect Flops from 1) to be less than Flops from run 2)

linpack runs very fast as long as all data fits into CPU cache. Thus performance of linpack is very sensitive to CPU switching. When kernel moves linpack run from one core to another we see big drop in Flops reported by linpack.

We would like to report here strange behavior of kernel task scheduler on ibm-hs22-04.rhts.eng.brq.redhat.com box equipped with 2x Intel Xeon E5530 cpus.

Please take a look on the following CPU-usage histogram for 8 CPU threads:
======================================================================================
                        CPU USER% UTILIZATION KEY
        |
        | space =  0- 2%    2 = 20-29%   5 = 50-59%   8 = 80- 89
        |     . =  3- 9%    3 = 30-39%   6 = 60-69%   9 = 90- 97
LOGICAL |     1 = 10-19%    4 = 40-49%   7 = 70-77%   * = 98-100
    CPU |
========+============================== Interval: 2 seconds ===>
      0 | ***************2       3 2 
      1 | ***********************    
      2 | 7************1             
      3 | 1                          
      4 |                       16*7 
      5 | 7**********************5   
      6 |                      99*** 
      7 |                            
      8 |                6*******    
      9 | .                          
     10 | .            9***********8 
     11 | *************************9 
     12 | 8**********************    
     13 |                            
     14 | *********************      
     15 | 8*********************9    
=======================================================================================

As you can see, following is happening:
linpackd process is moved from CPU #0 to CPU #8
another linpackd process is moved from CPU #2 to CPU #10
another linpackd process is moved from CPU #14 to CPU #6
another linpackd process is moved from CPU #15 to CPU #4


Histogram for affinity run looks like this:
======================================================================================
                        CPU USER% UTILIZATION KEY
        |
        | space =  0- 2%    2 = 20-29%   5 = 50-59%   8 = 80- 89
        |     . =  3- 9%    3 = 30-39%   6 = 60-69%   9 = 90- 97
LOGICAL |     1 = 10-19%    4 = 40-49%   7 = 70-77%   * = 98-100
    CPU |
========+============================== Interval: 2 seconds ===>
      0 | *************************4 
      1 | *************************4 
      2 | 9************************1 
      3 | 8************************4 
      4 | 8************************3 
      5 | 9************************1 
      6 | 8************************3 
      7 | 8************************3 
      8 |                            
      9 |                            
     10 |                            
     11 |                            
     12 |                            
     13 |                            
     14 |                  .         
     15 |                            
=====================================================================================

We would expect similar picture for the run without CPU affinities set.

Impact on linpackd results is performance drop for Flops by 7% in average. However, CPU switching does not affect all linpackd parallel runs. For some runs performance drop is as big as 17% !!

Statistical analysis (we do experiment described above 3 times):
==========================================================================================
$ cd results_2010-Jul-01_06h32m15s/CSV                                                                       
$ getstats --twosamplet 8streams_linpackd.affinity_merged.csv 8streams_linpackd.default.csv
8streams_linpackd.affinity_merged.csv
NAME      COUNT MEAN    MEDIAN  LOW     HIGH    MIN     MAX     SDEV% HW%   
MEGAFLOPS 24    335.583 335.351 333.936 337.230 328.469 342.077 1.162 0.491 

8streams_linpackd.default.csv
NAME      COUNT MEAN    MEDIAN  LOW     HIGH    MIN     MAX     SDEV% HW%   O/H    
MEGAFLOPS 24    313.842 312.158 300.668 327.017 280.768 360.985 9.941 4.198 -6.479 
===========================================================================================

As you can see, SDEV% for affinity runs is much bigger than for runs without CPU affinity set (check MIN and MAX values as well). It corresponds very well with histograms above.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
Hostname ibm-hs22-04.rhts.eng.brq.redhat.com
Arch     x86_64
Distro   RHEL6.0-20100622.1
Kernel   2.6.32-37.el6.x86_64
SElinux  Permissive

CPU count 16 : 16 @ 2527.000
CPU      Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
CPU cache 8192 KB

Total Memort 15941 (MB)
#NUMA nodes 2

===============CPU topology==================
Machine (16GB)
  NUMANode p#0 (8181MB) + Socket p#0 + L3 (8192KB)
    L2 (256KB) + L1 (32KB) + Core p#0
      PU p#0
      PU p#8
    L2 (256KB) + L1 (32KB) + Core p#1
      PU p#1
      PU p#9
    L2 (256KB) + L1 (32KB) + Core p#2
      PU p#2
      PU p#10
    L2 (256KB) + L1 (32KB) + Core p#3
      PU p#3
      PU p#11
  NUMANode p#1 (8192MB) + Socket p#1 + L3 (8192KB)
    L2 (256KB) + L1 (32KB) + Core p#0
      PU p#4
      PU p#12
    L2 (256KB) + L1 (32KB) + Core p#1
      PU p#5
      PU p#13
    L2 (256KB) + L1 (32KB) + Core p#2
      PU p#6
      PU p#14
    L2 (256KB) + L1 (32KB) + Core p#3
      PU p#7
      PU p#15
=============================================

How reproducible:
-----------------
Use ibm-hs22-04.rhts.eng.brq.redhat.com box. You can try it on another 2 socket box with 2x  Intel(R) Xeon(R) CPU E5530 @ 2.40GHz CPUs.

Steps to Reproduce:
1. Disable all cron daemons
2. Turn on all CPU performance governors to performance
3. Download linpack source code: http://cvs.devel.redhat.com/cgi-bin/cvsweb.cgi/tests/performance/linpack/linpack.tar?rev=1.1
4. Compile linpack. It will produce double precision version "linpackd" and single precision version "linpacks". Make sure to use "linpackd"
5. Start 8 parallel runs of linpackd
linpackd & linpackd & linpackd & linpackd & linpackd & linpackd & linpackd & linpackd
6. Use your favorite CPU monitor (mpstat. dstat, ....) and watch if runs moves from one CPU to another

  
Actual results:
taskscheduler is moving linpackd runs between different CPUs

Expected results:
Similar to affinity histogram above. No CPU switching.

Additional info:
See RHTS run http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=164900

I will attach RHTSlinpack-results_2010-Jul-01_06h32m15s.tar.gz from the RHTS job above.

Some interesting files from tar file:
commands.sh: take commands to set CPU governors from this file
CSV directory: compare files 8streams_linpackd.affinity_merged.csv and 8streams_linpackd.default.csv
rawdata/8streams directory: check linpackd.default.loop*histogram
Comment 1 Jiri Hladky 2010-07-01 20:08:58 EDT
Created attachment 428587 [details]
Complete run data

See
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=15030311

Some interesting files from tar file:
commands.sh: list of all important commands executed
CSV directory: compare files 8streams_linpackd.affinity_merged.csv and 8streams_linpackd.default.csv
rawdata/8streams directory: check linpackd.default.loop*histogram files
Comment 3 RHEL Product and Program Management 2010-07-01 20:23:05 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 4 RHEL Product and Program Management 2010-07-15 10:35:35 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 5 Johannes Weiner 2010-09-02 07:19:26 EDT
Is this a regression from RHEL5?

Otherwise, I don't think there is much we can do.  HPC workloads you heavily tune to the system they are running on anyway, so it makes much more sense to also set the CPU affinities up front than to hope the load balancer gets it right if every cycle is important to you.
Comment 6 Jiri Hladky 2010-09-02 10:39:19 EDT
Hi Johannes,

this is not regression to RHEL5.5.

I have compared following scenarios on RHEL 6.0:

1) Running multiple linpack jobs
2) Running multiple linpack jobs with CPU affinity set

Please see Description of the problem for detail explanation of the testing methodology.

I have run this test on ~20 different boxes and I see wrong results only on few CPU models.

Please let me know if you have any questions how to reproduce this bug.

Thanks
Jirka
Comment 7 Johannes Weiner 2010-09-02 12:14:53 EDT
I asked for RHEL5 because there might be a point in following-up on the problem if it was a regression against older versions.

As it stands, I still suggest to set the affinities manually if every cycle counts.  The load-balancer is not perfect and never will be.  There are other valid cases where kernel heuristics are bypassed when there is better knowledge on the situation than the kernel can infer (databases using direct IO to bypass the page cache e.g.).

Task migration entails cache misses, may it be possible that the CPU models behaving worst just have the highest cache miss penalties?
Comment 8 Jiri Hladky 2010-09-02 12:50:27 EDT
Hi Johannes,

I will schedule the jobs for RHEL5.5. I agree that it can help to find out if it's an issue in RHEL6.0 only or general problem of that given architecture.

However, on most of test systems I see very small differences between default and affinity runs. This one is just very strange. Worst case regression 17% is a lot. I don't believe that we can explain it with high cache miss penalties.

>Task migration entails cache misses, may it be possible that the CPU models
>behaving worst just have the highest cache miss penalties?

How could you find this information? 

XEON E5530 belongs to "Gainestown" family:

http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Gainestown.22_.2845_nm.29

This CPU has been released last year. I don't see why it should experience higher penalties than similar XEON CPUs released last year.

Intel's spec is here:
http://ark.intel.com/Product.aspx?id=37103

Thanks
Jirka
Comment 9 Jiri Hladky 2010-09-02 13:21:47 EDT
Hi Johannes,

I have found RHEL 5.5 data for
ibm-hs22-01.rhts.eng.brq.redhat.com
which is exactly is the same HW as
ibm-hs22-04.rhts.eng.brq.redhat.com

5.5 is clean 6.0 is not. I do claim that it's really bug in RHEL 6.0 kernel:

========================5.5===================================================
RESULTS: (in MEGAflops)

                 |          S C H E D U L I N G    M O D E         |DEFAULT &  |
                 |                                                 |AFFINITY   |
                 |                                                 |COMPARISON |
NUMBER |FLOATING |         DEFAULT        |       CPU AFFINITY     |           |
  OF   |  POINT  |                        |                        | %   TEST  |
STREAMS|PRECISION| TOTAL   AVG STDEV SCALE| TOTAL   AVG STDEV SCALE|DIFF STATUS|
-------+---------+------------------------+------------------------+-----------+
   1      Single |  1941  1941   7.4    - |  1940  1940   8.7    - |   0   PASS|
   1      Double |   509   509   0.7    - |   504   504   5.2    - |   0   PASS|

   2      Single |  3884  1942   2.4  2.00|  3872  1936  10.6  2.00|   0   PASS|
   2      Double |  1008   504   6.7  1.98|  1008   504   4.0  2.00|   0   PASS|

   4      Single |  7304  1826  34.8  3.76|  7148  1787  50.1  3.68|   0   PASS|
   4      Double |  1604   401  27.0  3.15|  1604   401  26.7  3.18|   0   PASS|

   8      Single |  4552   569 130.3  2.35|  4528   566 133.6  2.33|   0   PASS|
   8      Double |  1344   168  39.2  2.64|  1352   169  41.3  2.68|   0   PASS|

  16      Single |  2656   166  44.0  1.37|  2672   167  45.1  1.38|   0   PASS|
  16      Double |  1216    76  21.0  2.39|  1216    76  20.9  2.41|   0   PASS|
===============================================================================

And RHEL 6.0, kernel 2.6.32-44.el6.x86_64

================================6.0============================================
RESULTS: (in MEGAflops)

                 |          S C H E D U L I N G    M O D E         |DEFAULT &  |
                 |                                                 |AFFINITY   |
                 |                                                 |COMPARISON |
NUMBER |FLOATING |         DEFAULT        |       CPU AFFINITY     |           |
  OF   |  POINT  |                        |                        | %   TEST  |
STREAMS|PRECISION| TOTAL   AVG STDEV SCALE| TOTAL   AVG STDEV SCALE|DIFF STATUS|
-------+---------+------------------------+------------------------+-----------+
   1      Single |  2041  2041   6.0    - |  2043  2043   4.6    - |   0   PASS|
   1      Double |  1659  1659   8.9    - |  1660  1660   7.5    - |   0   PASS|

   2      Single |  4090  2045   1.8  2.00|  4080  2040   8.1  2.00|   0   PASS|
   2      Double |  3276  1638  26.8  1.97|  3296  1648  11.7  1.99|   0   PASS|

   4      Single |  7896  1974  33.5  3.87|  7780  1945  24.0  3.81|   0   PASS|
   4      Double |  3304   826  93.2  1.99|  3512   878  39.4  2.12|   6   FAIL|

   8      Single |  7496   937  90.1  3.67|  7992   999  33.5  3.91|   7   FAIL|
   8      Double |  2472   309  30.5  1.49|  2616   327   5.2  1.58|   6   FAIL|

  16      Single |  5152   322   5.9  2.52|  5040   315   4.5  2.47|   0   PASS|
  16      Double |  2496   156   1.0  1.50|  2480   155   0.9  1.49|   0   PASS|
===============================================================================

Please notice that 's failing for 8threads both in single and double precision. And for 4 threads but only at double precision.

I see exactly the same on ibm-hs22-04.rhts.eng.brq.redhat.com

========================2.6.32-37.el6.x86_64===================================
RESULTS: (in MEGAflops)

                 |                      S C H E D U L I N G    M O D E                      | DEFAULT COMPARED TO   |
                 |                                                                          |                       |
                 |                                                                          | AFFINITY  |   NUMA    |
NUMBER |FLOATING |         DEFAULT        |       CPU AFFINITY     |       NUMA AFFINITY    |           |           |
  OF   |  POINT  |                        |                        |                        | %   TEST  | %   TEST  |
STREAMS|PRECISION| TOTAL   AVG STDEV SCALE| TOTAL   AVG STDEV SCALE| TOTAL   AVG STDEV SCALE|DIFF STATUS|DIFF STATUS|
-------+---------+------------------------+------------------------+------------------------+-----------+-----------+
   1      Single |  2147  2147   4.7    - |  2148  2148   4.3    - |  2146  2146   5.6    - |   0   PASS|   0   PASS|
   1      Double |  1724  1724  24.2    - |  1742  1742  10.6    - |  1742  1742  10.2    - |   0   PASS|   0   PASS|

   2      Single |  4292  2146   6.6  2.00|  4300  2150   1.0  2.00|  4300  2150   1.3  2.00|   0   PASS|   0   PASS|
   2      Double |  3442  1721  40.6  2.00|  3476  1738   7.5  2.00|  3474  1737   5.9  1.99|   0   PASS|   0   PASS|

   4      Single |  8292  2073  31.3  3.86|  8220  2055  27.3  3.83|  8268  2067  26.2  3.85|   0   PASS|   0   PASS|
   4      Double |  3280   820 123.1  1.90|  3468   867  62.2  1.99|  3600   900  38.4  2.07|   5   FAIL|   9   FAIL|

   8      Single |  7736   967 136.4  3.60|  8336  1042  40.5  3.88|  8328  1041  34.2  3.88|   7   FAIL|   7   FAIL|
   8      Double |  2496   312  31.8  1.45|  2672   334   4.7  1.53|  2632   329   5.2  1.51|   7   FAIL|   5   FAIL|

  16      Single |  5136   321  10.3  2.39|  5120   320   3.3  2.38|  5216   326   8.1  2.43|   0   PASS|   0   PASS|
  16      Double |  2512   157   1.1  1.46|  2496   156   0.9  1.43|  2528   158   1.2  1.45|   0   PASS|   0   PASS|
===========================================================================

Again it's failing for 8threads both in single and double precision. And for 4 threads but only at double precision.

Based on this I'm confident that there is a bug in RHEL 6.0 kernel.

Thanks
Jirka
Comment 10 Johannes Weiner 2010-09-09 05:02:41 EDT
(In reply to comment #8)
> Hi Johannes,
> 
> I will schedule the jobs for RHEL5.5. I agree that it can help to find out if
> it's an issue in RHEL6.0 only or general problem of that given architecture.
> 
> However, on most of test systems I see very small differences between default
> and affinity runs. This one is just very strange. Worst case regression 17% is
> a lot. I don't believe that we can explain it with high cache miss penalties.

The counter theory would be that the task scheduler behaves differently depending on the CPU model, but I can not really imagine that.

Can you redo the runs prefixed with `perf stat'?  It will report the number of task migrations the process underwent.

> >Task migration entails cache misses, may it be possible that the CPU models
> >behaving worst just have the highest cache miss penalties?
> 
> How could you find this information? 

There are CPU cache benchmarks, but I never used any of them, so I can not recommend one in particular.
Comment 11 Jiri Hladky 2010-09-10 08:27:00 EDT
Hallo Johannes,

I have rerun the tests using RHEL 6.0 Snapshot 13 and the issue is fixed! I don't see any regression anymore.

We can close this BZ.

Thanks for "perf stat" hint I will use it in the future to track down similar problems.


Thanks
Jirka
Comment 12 Johannes Weiner 2012-03-30 12:33:42 EDT
*** Bug 613476 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.