Bug 1065304

Summary:

kernel/sched: incorrect setup of sched_group->cpu_power for NUMA systems

Product:

Red Hat Enterprise Linux 6

Reporter:

Pär Lindfors <paran+rhbugzilla>

Component:

kernel

Assignee:

Radim Krčmář <rkrcmar>

Status:

CLOSED ERRATA

QA Contact:

Jiri Hladky <jhladky>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.5

CC:

cap, ccui, csieh, dhoward, djdumas, foraker1, gcturner, jbrouer, jhladky, johnny, kcleveng, kkolakow, lmiccini, lwoodman, mej, michele, mschuppe, msvoboda, orion, pablo.iranzo, pasteur, perfbz, prarit, qzhang, rkrcmar, sauchter, stalexan, tgummels, tommi.tervo, toracat, woodard

Target Milestone:

Keywords:

Performance, Regression, ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-461.el6

Doc Type:

Bug Fix

Doc Text:

A previous patch to the kernel scheduler fixed a kernel panic caused by a divide-by-zero bug in the init_numa_sched_groups_power() function. However, that patch introduced a regression on systems with standard Non-Uniform Memory Access (NUMA) topology so that cpu_power in all but one NUMA domains was set to twice the expected value. This resulted in incorrect task scheduling and some processors being left idle even though there were enough queued tasks to handle, which had a negative impact on system performance. This update ensures that cpu_power on systems with standard NUMA topology is set to expected values by adding an estimate to cpu_power for every uncounted CPU.Task scheduling now works as expected on these systems without performance issues related to the said bug.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-10-14 05:57:17 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

994246, 1091826

Attachments:

Description	Flags
revert sg->cpu_power setting code	none
sched/numa: fix cpu_power initialization	none
sched/numa: fix cpu_power initialization	none
Results with 2.6.32-431.18.1.el6.bz1065304.x86_64 kernel on 4 NUMA system	none

Description Pär Lindfors 2014-02-14 10:12:38 UTC

Description of problem:

First of all I must admit that I have only verified this bug on
CentOS 6.5. I do not have any Red Hat support.

Memory bandwidth measured with the STREAM benchmark show a
significant regression in 6.5 (kernel 2.6.32-431.5.1.el6.x86_64)
as compared to 6.4 (kernel 2.6.32-358.23.2.el6.x86_64).

The regression is caused by the kernel. Booting a 6.4 kernel on a
fully updated 6.5 system gives normal results.

The regression only seem to happen on NUMA machines.

If libgomp is forced to bind threads to specific
CPUs (GOMP_CPU_AFFINITY) we get good performance on 6.5. Since
this is not needed with a 6.4 kernel, I suspect that the
scheduler in the 6.5 kernel moves threads around between NUMA
zones when it shouldn't.

Problem verified on:
HP ProLiant DL380p Gen8, dual socket, Intel Sandy Bridge E5-2660, 128G RAM
HP ProLiant SL230s Gen8, dual socket, Intel Sandy Bridge E5-2660, 32G RAM
2x Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (16 threads no HT) with 128GB of RAM

Not affected systems:
HP ProLiant DL140G3, dual socket, Intel Clovertown E5345, 16G RAM (non-NUMA)
Intel DH67CF with core i3-3220T, 16GB RAM  (non-NUMA)


How reproducible:
Always

Steps to Reproduce:

Obtain and compile the STREAM benchmark:

  [paran@trio stream]$ wget \
  http://www.cs.virginia.edu/stream/FTP/Code/stream.c
  [paran@trio stream]$ gcc -mtune=native -march=native -O3 -mcmodel=medium \
  -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=2 -o stream.1e8 stream.c

Results below are from a HP DL380p Gen8:

CentOS 6.4, kernel 2.6.32-358.23.2.el6.x86_64, 1e8 array:

  [paran@triolith3 stream]$ while `sleep 1`;do ./stream.1e8|grep -A4 ^Function|awk '{print $2}'|xargs;done
  Best 49209.8 49276.3 55422.2 55209.1
  Best 49482.3 48876.1 55419.7 55590.2
  Best 49374.9 49431.6 55559.5 55119.0
  Best 49240.1 49157.9 55400.5 55145.9

CentOS 6.5, kernel 2.6.32-431.5.1.el6.x86_64, 1e8 array:

  [root@triolith3 stream]# while `sleep 1`;do ./stream.1e8|grep -A4 ^Function|awk '{print $2}'|xargs;done
  Best 31580.0 29712.7 36366.4 36529.1
  Best 28181.4 26685.4 33149.1 34715.2
  Best 26501.9 27162.4 33261.2 34196.3
  Best 27582.0 28532.8 36617.2 36609.9

CentOS 6.5, kernel 2.6.32-431.5.1.el6.x86_64, 1e8 array, using
GOMP_CPU_AFFINITY work around:

  [root@triolith3 stream]# while `sleep 1`;do GOMP_CPU_AFFINITY="0-15" ./stream.1e8|grep -A4 ^Function|awk '{print $2}'|xargs;done
  Best 49137.0 48956.7 55140.7 55135.6
  Best 48944.9 49022.5 54875.9 55149.5
  Best 49214.1 49305.2 55018.1 55446.3
  Best 49321.9 49058.7 55076.2 55368.5

Comment 2 Pär Lindfors 2014-02-20 16:50:20 UTC

We now believe that the underlying problem is that the
cpu-scheduler puts several threads on the same core(s) while
idling others.

Steps to Reproduce:

1. Start "md5sum /dev/zero" x times for x processors on system:

  $ for i in $(seq 1 $(egrep "^processor" /proc/cpuinfo | wc -l)) ; \
    do md5sum /dev/zero & done

2. Run top, hit "1" to get per-cpu view.

Actual results:
One or more processors are fully or partially idle.

Expected results (observed with 6.4 kernel):
All processors have zero or near zero idle numbers.

Comment 3 Pär Lindfors 2014-02-27 12:27:01 UTC

My colleague Peter Kjellström have have found the root cause for
this problem. I have updated summary to reflect the actual bug.

Here is Peters comment, copied from CentOS bugtracker:


Wrote a systemtap script that dumps all relevant
information (sched_domain, sched_groups, ...). It seems the
problem is that one numa zone gets an incorrect cpu_power. On
6.4 (output from my stap script on a 20 core IVB server):

 sdlevel:5 sdflags:1071 sdspan:11111111111111111111 sdname:"NODE"
 grpcpupow: 10238 cpupoworig: 0 mask:11111111110000000000
 grpcpupow: 10240 cpupoworig: 0 mask:00000000001111111111

On 6.5:
 sdlevel:5 sdflags:1071 sdspan:11111111111111111111 sdname:"NODE"
 grpcpupow: 10238 cpupoworig: 0 mask:11111111110000000000
 grpcpupow: 20470 cpupoworig: 0 mask:00000000001111111111

Note how the 2nd sched_group in the "NODE" sched_domain has about
2x the expected value (it's supposed to be both ~equal to the
first sched_group and ~1024 * numcores in group).

I've successfully tried to update the value on a running kernel
with systemtap and this fixes the problem.

I've also reverted a part of sched.c and rebuilt. This also fixes
the problem.

I suspect that this is what caused it (fix boot problem on exotic
machine and break all normal machines...):

* Tue Jul 02 2013 Jarod Wilson <jarod> [2.6.32-395.el6]
- [kernel] sched: make weird topologies bootable (Radim Krcmar) [892677]

Comment 4 Pär Lindfors 2014-02-27 14:52:55 UTC

Created attachment 868573 [details]
revert sg->cpu_power setting code

Comment 7 Radim Krčmář 2014-03-04 21:53:51 UTC

Created attachment 870669 [details]
sched/numa: fix cpu_power initialization

Thanks for the great report, would you mind verifying this patch?

Comment 8 Pär Lindfors 2014-03-04 23:51:58 UTC

You're welcome. We will test the patch as soon as possible.

Please consider making this bug public.

Comment 9 Peter K 2014-03-05 10:33:17 UTC

Short summary: your patch seems to have the same effect as one of "revert to 
6.4", "adjust cpu_power live with systemtap" or "my previous patch".

We've built a 2.6.32-431.5.1.el6 with your patch and done the following tests 
on a dual socket Xeon-E5v2 (2x10 core):

[HT:on/off] systemtap script shows ok cpu_power for all sg in all sd

[HT:on/off] placement of cpu-hungry processes (***) on partially full machine 
            now balances correctly over sockets

[HT:on] placement of cpu-hungry processes on partially full machine 
        now balances correctly over siblings

[HT:on] placement of cpu-hungry processes on full machine near perfect 

[HT:off] placement of cpu-hungry processes on full machine much better (**)

 Patch improved situation greatly but sometimes (~1/20) starting 40 cpu-
hungry processes quickly ends up with 21 on one socket and 19 on the other. 
This situation does not fix itself (at least not typically nor quickly). 
Starting the cpu-hungry processes with a small delay has never turned up a 
21/19 split.

NOTE: This behavior is also seen on the 6.4 kernel or with any of the other 
fixes to the 6.5 kernel. Very likely a different bug (by race/chance misplaced 
initially and never migrated?).

(**) Same as  but much more common (~1/2). Interestingly enough a similar 
system with Xeon-E5(not v2) (2x 8-core) can not be provoked to show this bug 
(~100+ cycles done).

(***) "md5sum /dev/zero" is used as cpu-hungry process. Running multiple 
instances concurrently is done with:
 "for i in $(seq 1 $count) ; do md5sum /dev/zero & done"

Comment 10 Jesper Brouer 2014-03-05 10:47:14 UTC

(In reply to Pär Lindfors from comment #8)

> Please consider making this bug public.

Okay :-)

Comment 11 Radim Krčmář 2014-03-13 17:28:02 UTC

A possible solution for any rhel6.5 NUMA regression ...
http://post-office.corp.redhat.com/archives/rhkernel-list/2014-March/msg01190.html

Comment 12 Radim Krčmář 2014-03-13 17:32:35 UTC

*** Bug 1071402 has been marked as a duplicate of this bug. ***

Comment 13 Akemi Yagi 2014-03-13 18:03:17 UTC

Krčmář from comment #11)
> A possible solution for any rhel6.5 NUMA regression ...
> http://post-office.corp.redhat.com/archives/rhkernel-list/2014-March/
> msg01190.html

Hi Radim,

Is it possible for you to make that info publicly available?

Comment 14 Radim Krčmář 2014-03-14 12:46:26 UTC

Created attachment 874403 [details]
sched/numa: fix cpu_power initialization

(In reply to Akemi Yagi from comment #13)
> Is it possible for you to make that info publicly available?

Of course, attached patch is in mailbox format.
(It talks about why it happened and how it is fixed now,
 consequences of the bug are described in comments above.)

Comment 15 Akemi Yagi 2014-03-14 13:49:45 UTC

Thanks, Radim. Greatly appreciated.

Comment 18 RHEL Program Management 2014-04-15 13:11:50 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 23 Jiri Hladky 2014-04-28 13:24:40 UTC

Verified the bug fix with 2.6.32-431.18.1.el6.bz1065304 on 4 NUMA system

Comment 24 Peter K 2014-04-28 13:27:04 UTC

Sorry for a stupid question, do the above updates indicate
that this bug will not be fixed in a normal 6.5 kernel update
but as EUS for 6.5 (and presumably fixed from day 1 in 6.6)?

Comment 25 Jiri Hladky 2014-04-28 13:58:46 UTC

The bug is NOT fixed by the brew kernel https://brewweb.devel.redhat.com/taskinfo?taskID=7386533 build.

First run of stream.1e8_RHEL65 on 2.6.32-431.18.1.el6.bz1065304.x86_64 kernel performs as expected. Second and all additional runs perform badly. 

I have rebboted the box and got again the same behaviour
- first run of stream.1e8 performs as expected
- all consequential runs perform poorly

I will upload the results along with run.sh used to run the test.

Thanks
Jirka

$ ./run.sh 
17,18c17,18
< Each test below will take on the order of 30035 microseconds.
<    (= 30035 clock ticks)
---
> Each test below will take on the order of 29029 microseconds.
>    (= 29029 clock ticks)
27,30c27,30
< Copy:           35372.4     0.045233     0.045233     0.045233
< Scale:          36881.5     0.043382     0.043382     0.043382
< Add:            41961.7     0.057195     0.057195     0.057195
< Triad:          41971.2     0.057182     0.057182     0.057182
---
> Copy:           36049.8     0.044383     0.044383     0.044383
> Scale:          36452.2     0.043893     0.043893     0.043893
> Add:            40353.8     0.059474     0.059474     0.059474
> Triad:          40671.7     0.059009     0.059009     0.059009
15:43:19 root.eng.brq.redhat.com: /home/NFS
$ ./run.sh 
17,18c17,18
< Each test below will take on the order of 48653 microseconds.
<    (= 48653 clock ticks)
---
> Each test below will take on the order of 29855 microseconds.
>    (= 29855 clock ticks)
27,30c27,30
< Copy:           25256.9     0.063349     0.063349     0.063349
< Scale:          28167.5     0.056803     0.056803     0.056803
< Add:            31327.1     0.076611     0.076611     0.076611
< Triad:          33849.9     0.070901     0.070901     0.070901
---
> Copy:           36158.2     0.044250     0.044250     0.044250
> Scale:          37312.6     0.042881     0.042881     0.042881
> Add:            41475.2     0.057866     0.057866     0.057866
> Triad:          41585.1     0.057713     0.057713     0.057713

Comment 26 Jiri Hladky 2014-04-28 14:01:04 UTC

Created attachment 890505 [details]
Results with 2.6.32-431.18.1.el6.bz1065304.x86_64 kernel on 4 NUMA system

Comment 27 Jiri Hladky 2014-04-28 14:02:31 UTC

(In reply to Peter K from comment #24)
> Sorry for a stupid question, do the above updates indicate
> that this bug will not be fixed in a normal 6.5 kernel update
> but as EUS for 6.5 (and presumably fixed from day 1 in 6.6)?

The fix is planned for RHEL 6.5 as well - see BZ1091826

Comment 29 Radim Krčmář 2014-04-28 17:31:27 UTC

Could you paste the cpu powers on your test machine?
(Ideally before and after the test regresses.)

crash> p node_domains
PER-CPU DATA TYPE:
  struct static_sched_domain per_cpu__node_domains;
PER-CPU ADDRESSES:
  [0]: $address
[...]

crash> p ((struct static_sched_domain *)0x$address)->sd.groups
$3 = (struct sched_group *) $groups_address

crash> list -s sched_group $groups_address
$wanted_output

Thanks.


---
If you wish to verify that the groups are being used: (through percpu runqueues)

crash> p runqueues
crash> p ((struct rq *)0x$address)->sd->parent->parent->groups

This command should print the same address as the above one.
Number of parents could be different; the right one has 'SD_LV_NODE' in it's 'level', which can be queried instead of 'groups'. (I don't have NUMA at hand.)

Comment 31 Jiri Hladky 2014-04-28 19:30:58 UTC

Results with 
2.6.32-431.18.1.el6.bz1065304_revert.x86_64
kernel

Quick summary:
cpu powers do not change
second and subsequent runs are slower than the first run

User time grows from 0m40.312s to 0m47.628s

Details are bellow.

I will test it with RHEL6.4 kernel to see if 6.4 kernel behaves the same way or not.

Jirka


After FRESH boot
=========================================================================
crash> p ((struct static_sched_domain *)0xffff880028210ba0)->sd.groups
$3 = (struct sched_group *) 0xffff88023c676000
crash> list -s sched_group 0xffff88023c676000
ffff88023c676000
struct sched_group {
  next = 0xffff88023c68ec00, 
  cpu_power = 7062, 
  cpu_power_orig = 0, 
  cpumask = 0xffff88023c676010
}
ffff88023c68ec00
struct sched_group {
  next = 0xffff88023c68e800, 
  cpu_power = 7056, 
  cpu_power_orig = 0, 
  cpumask = 0xffff88023c68ec10
}
ffff88023c68e800
struct sched_group {
  next = 0xffff88023c68e400, 
  cpu_power = 7056, 
  cpu_power_orig = 0, 
  cpumask = 0xffff88023c68e810
}
ffff88023c68e400
struct sched_group {
  next = 0xffff88023c676000, 
  cpu_power = 7056, 
  cpu_power_orig = 0, 
  cpumask = 0xffff88023c68e410
$ more run.sh 
#!/bin/bash

LOG=$(uname -r).log
time ./stream.1e8_RHEL65 > ./stream.1e8_RHEL65.${LOG}

CPUS=$( lscpu | grep line | awk -F':' '{print $2}' | xargs)

time GOMP_CPU_AFFINITY=${CPUS} ./stream.1e8_RHEL65 > ./stream.1e8_RHEL65.GOMP_CPU_AFFINITY_${CPUS}.${LOG}

diff ./stream.1e8_RHEL65.${LOG} ./stream.1e8_RHEL65.GOMP_CPU_AFFINITY_${CPUS}.${LOG}


$ ./run.sh 

real    0m1.417s
user    0m40.312s
sys     0m6.991s

real    0m1.305s
user    0m35.200s
sys     0m6.270s
17,18c17,18
< Each test below will take on the order of 30066 microseconds.
<    (= 30066 clock ticks)
---
> Each test below will take on the order of 28978 microseconds.
>    (= 28978 clock ticks)
27,30c27,30
< Copy:           35564.2     0.044989     0.044989     0.044989
< Scale:          36184.3     0.044218     0.044218     0.044218
< Add:            39718.0     0.060426     0.060426     0.060426
< Triad:          39964.8     0.060053     0.060053     0.060053
---
> Copy:           35082.4     0.045607     0.045607     0.045607
> Scale:          37254.3     0.042948     0.042948     0.042948
> Add:            41432.2     0.057926     0.057926     0.057926
> Triad:          41663.0     0.057605     0.057605     0.057605

==========================================================================
SECOND RUN
==========================================================================

crash> list -s sched_group 0xffff88023c676000
ffff88023c676000
struct sched_group {
  next = 0xffff88023c68ec00, 
  cpu_power = 7068, 
  cpu_power_orig = 0, 
  cpumask = 0xffff88023c676010
}
ffff88023c68ec00
struct sched_group {
  next = 0xffff88023c68e800, 
  cpu_power = 7056, 
  cpu_power_orig = 0, 
  cpumask = 0xffff88023c68ec10
}
ffff88023c68e800
struct sched_group {
  next = 0xffff88023c68e400, 
  cpu_power = 7056, 
  cpu_power_orig = 0, 
  cpumask = 0xffff88023c68e810
}
ffff88023c68e400
struct sched_group {
  next = 0xffff88023c676000, 
  cpu_power = 7056, 
  cpu_power_orig = 0, 
  cpumask = 0xffff88023c68e410
}

$ ./run.sh 

real    0m1.627s
user    0m47.628s
sys     0m6.326s

real    0m1.417s
user    0m38.632s
sys     0m6.469s
17,18c17,18
< Each test below will take on the order of 47596 microseconds.
<    (= 47596 clock ticks)
---
> Each test below will take on the order of 29953 microseconds.
>    (= 29953 clock ticks)
27,30c27,30
< Copy:           25982.5     0.061580     0.061580     0.061580
< Scale:          23853.5     0.067076     0.067076     0.067076
< Add:            29196.4     0.082202     0.082202     0.082202
< Triad:          30075.2     0.079800     0.079800     0.079800
---
> Copy:           35174.0     0.045488     0.045488     0.045488
> Scale:          35811.8     0.044678     0.044678     0.044678
> Add:            39256.7     0.061136     0.061136     0.061136
> Triad:          39515.2     0.060736     0.060736     0.060736

Comment 32 Peter K 2014-04-29 08:03:04 UTC

We've noticed that stream does not always run at the same speed presumably due to other problems. For this reason we switched to the testing described in comment 2 which is simpler and more clear.

We also noted that the fixed kernel still sometimes fails to place things correctly (see "NOTE:" in comment 9). This happens on other 6.x we've tested too and is belived to be a different bug.

Comment 34 Jiri Hladky 2014-04-29 12:57:56 UTC

Results with RHEL6.4 kernel 2.6.32-358.el6.x86_64 on RHEL6.5

Quick summary:
cpu powers as reported by crash utility: do not change
second and subsequent runs are slower than the first run

User time grows from 35s for the run after the fresh boot to 47s in second and all subsequent runs.

Results are same as with 2.6.32-431.18.1.el6.bz1065304.x86_64 kernel and the issue reported in this BZ is solved. Now looking into why the performance goes down between first and second run of stream benchmark.

Jirka

Details are bellow.


After FRESH reboot:
./run.sh 

real    0m1.323s
user    0m35.419s
sys     0m6.858s

real    0m1.332s
user    0m36.008s
sys     0m6.592s
17,18c17,18
< Each test below will take on the order of 30553 microseconds.
<    (= 30553 clock ticks)
---
> Each test below will take on the order of 30068 microseconds.
>    (= 30068 clock ticks)
27,30c27,30
< Copy:           35004.1     0.045709     0.045709     0.045709
< Scale:          36458.9     0.043885     0.043885     0.043885
< Add:            40872.5     0.058719     0.058719     0.058719
< Triad:          41117.1     0.058370     0.058370     0.058370
---
> Copy:           35226.8     0.045420     0.045420     0.045420
> Scale:          35832.7     0.044652     0.044652     0.044652
> Add:            39426.3     0.060873     0.060873     0.060873
> Triad:          39856.4     0.060216     0.060216     0.060216


==========================================================
SECOND RUN
==========================================================
$ ./run.sh 

real    0m1.584s
user    0m47.869s
sys     0m6.302s

real    0m1.310s
user    0m35.817s
sys     0m6.290s
17,18c17,18
< Each test below will take on the order of 39979 microseconds.
<    (= 39979 clock ticks)
---
> Each test below will take on the order of 29968 microseconds.
>    (= 29968 clock ticks)
27,30c27,30
< Copy:           24987.2     0.064033     0.064033     0.064033
< Scale:          25594.3     0.062514     0.062514     0.062514
< Add:            31146.1     0.077056     0.077056     0.077056
< Triad:          31230.1     0.076849     0.076849     0.076849
---
> Copy:           35179.6     0.045481     0.045481     0.045481
> Scale:          35883.3     0.044589     0.044589     0.044589
> Add:            39096.8     0.061386     0.061386     0.061386
> Triad:          40985.0     0.058558     0.058558     0.058558

Comment 35 Sterling Alexander 2014-04-29 13:18:14 UTC

*** Bug 1069256 has been marked as a duplicate of this bug. ***

Comment 37 Rafael Aquini 2014-04-30 02:21:00 UTC

Patch(es) available on kernel-2.6.32-461.el6

Comment 42 errata-xmlrpc 2014-10-14 05:57:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-1392.html