Bug 1065304
Summary: | kernel/sched: incorrect setup of sched_group->cpu_power for NUMA systems | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Pär Lindfors <paran+rhbugzilla> |
Component: | kernel | Assignee: | Radim Krčmář <rkrcmar> |
Status: | CLOSED ERRATA | QA Contact: | Jiri Hladky <jhladky> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.5 | CC: | cap, ccui, csieh, dhoward, djdumas, foraker1, gcturner, jbrouer, jhladky, johnny, kcleveng, kkolakow, lmiccini, lwoodman, mej, michele, mschuppe, msvoboda, orion, pablo.iranzo, pasteur, perfbz, prarit, qzhang, rkrcmar, sauchter, stalexan, tgummels, tommi.tervo, toracat, woodard |
Target Milestone: | rc | Keywords: | Performance, Regression, ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-2.6.32-461.el6 | Doc Type: | Bug Fix |
Doc Text: |
A previous patch to the kernel scheduler fixed a kernel panic caused by a divide-by-zero bug in the init_numa_sched_groups_power() function. However, that patch introduced a regression on systems with standard Non-Uniform Memory Access (NUMA) topology so that cpu_power in all but one NUMA domains was set to twice the expected value. This resulted in incorrect task scheduling and some processors being left idle even though there were enough queued tasks to handle, which had a negative impact on system performance. This update ensures that cpu_power on systems with standard NUMA topology is set to expected values by adding an estimate to cpu_power for every uncounted CPU.Task scheduling now works as expected on these systems without performance issues related to the said bug.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2014-10-14 05:57:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 994246, 1091826 | ||
Attachments: |
Description
Pär Lindfors
2014-02-14 10:12:38 UTC
We now believe that the underlying problem is that the cpu-scheduler puts several threads on the same core(s) while idling others. Steps to Reproduce: 1. Start "md5sum /dev/zero" x times for x processors on system: $ for i in $(seq 1 $(egrep "^processor" /proc/cpuinfo | wc -l)) ; \ do md5sum /dev/zero & done 2. Run top, hit "1" to get per-cpu view. Actual results: One or more processors are fully or partially idle. Expected results (observed with 6.4 kernel): All processors have zero or near zero idle numbers. My colleague Peter Kjellström have have found the root cause for this problem. I have updated summary to reflect the actual bug. Here is Peters comment, copied from CentOS bugtracker: Wrote a systemtap script that dumps all relevant information (sched_domain, sched_groups, ...). It seems the problem is that one numa zone gets an incorrect cpu_power. On 6.4 (output from my stap script on a 20 core IVB server): sdlevel:5 sdflags:1071 sdspan:11111111111111111111 sdname:"NODE" grpcpupow: 10238 cpupoworig: 0 mask:11111111110000000000 grpcpupow: 10240 cpupoworig: 0 mask:00000000001111111111 On 6.5: sdlevel:5 sdflags:1071 sdspan:11111111111111111111 sdname:"NODE" grpcpupow: 10238 cpupoworig: 0 mask:11111111110000000000 grpcpupow: 20470 cpupoworig: 0 mask:00000000001111111111 Note how the 2nd sched_group in the "NODE" sched_domain has about 2x the expected value (it's supposed to be both ~equal to the first sched_group and ~1024 * numcores in group). I've successfully tried to update the value on a running kernel with systemtap and this fixes the problem. I've also reverted a part of sched.c and rebuilt. This also fixes the problem. I suspect that this is what caused it (fix boot problem on exotic machine and break all normal machines...): * Tue Jul 02 2013 Jarod Wilson <jarod> [2.6.32-395.el6] - [kernel] sched: make weird topologies bootable (Radim Krcmar) [892677] Created attachment 868573 [details]
revert sg->cpu_power setting code
Created attachment 870669 [details]
sched/numa: fix cpu_power initialization
Thanks for the great report, would you mind verifying this patch?
You're welcome. We will test the patch as soon as possible. Please consider making this bug public. Short summary: your patch seems to have the same effect as one of "revert to 6.4", "adjust cpu_power live with systemtap" or "my previous patch". We've built a 2.6.32-431.5.1.el6 with your patch and done the following tests on a dual socket Xeon-E5v2 (2x10 core): [HT:on/off] systemtap script shows ok cpu_power for all sg in all sd [HT:on/off] placement of cpu-hungry processes (***) on partially full machine now balances correctly over sockets [HT:on] placement of cpu-hungry processes on partially full machine now balances correctly over siblings [HT:on] placement of cpu-hungry processes on full machine near perfect [HT:off] placement of cpu-hungry processes on full machine much better (**) Patch improved situation greatly but sometimes (~1/20) starting 40 cpu- hungry processes quickly ends up with 21 on one socket and 19 on the other. This situation does not fix itself (at least not typically nor quickly). Starting the cpu-hungry processes with a small delay has never turned up a 21/19 split. NOTE: This behavior is also seen on the 6.4 kernel or with any of the other fixes to the 6.5 kernel. Very likely a different bug (by race/chance misplaced initially and never migrated?). (**) Same as but much more common (~1/2). Interestingly enough a similar system with Xeon-E5(not v2) (2x 8-core) can not be provoked to show this bug (~100+ cycles done). (***) "md5sum /dev/zero" is used as cpu-hungry process. Running multiple instances concurrently is done with: "for i in $(seq 1 $count) ; do md5sum /dev/zero & done" (In reply to Pär Lindfors from comment #8) > Please consider making this bug public. Okay :-) A possible solution for any rhel6.5 NUMA regression ... http://post-office.corp.redhat.com/archives/rhkernel-list/2014-March/msg01190.html *** Bug 1071402 has been marked as a duplicate of this bug. *** Krčmář from comment #11) > A possible solution for any rhel6.5 NUMA regression ... > http://post-office.corp.redhat.com/archives/rhkernel-list/2014-March/ > msg01190.html Hi Radim, Is it possible for you to make that info publicly available? Created attachment 874403 [details] sched/numa: fix cpu_power initialization (In reply to Akemi Yagi from comment #13) > Is it possible for you to make that info publicly available? Of course, attached patch is in mailbox format. (It talks about why it happened and how it is fixed now, consequences of the bug are described in comments above.) Thanks, Radim. Greatly appreciated. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Verified the bug fix with 2.6.32-431.18.1.el6.bz1065304 on 4 NUMA system Sorry for a stupid question, do the above updates indicate that this bug will not be fixed in a normal 6.5 kernel update but as EUS for 6.5 (and presumably fixed from day 1 in 6.6)? The bug is NOT fixed by the brew kernel https://brewweb.devel.redhat.com/taskinfo?taskID=7386533 build. First run of stream.1e8_RHEL65 on 2.6.32-431.18.1.el6.bz1065304.x86_64 kernel performs as expected. Second and all additional runs perform badly. I have rebboted the box and got again the same behaviour - first run of stream.1e8 performs as expected - all consequential runs perform poorly I will upload the results along with run.sh used to run the test. Thanks Jirka $ ./run.sh 17,18c17,18 < Each test below will take on the order of 30035 microseconds. < (= 30035 clock ticks) --- > Each test below will take on the order of 29029 microseconds. > (= 29029 clock ticks) 27,30c27,30 < Copy: 35372.4 0.045233 0.045233 0.045233 < Scale: 36881.5 0.043382 0.043382 0.043382 < Add: 41961.7 0.057195 0.057195 0.057195 < Triad: 41971.2 0.057182 0.057182 0.057182 --- > Copy: 36049.8 0.044383 0.044383 0.044383 > Scale: 36452.2 0.043893 0.043893 0.043893 > Add: 40353.8 0.059474 0.059474 0.059474 > Triad: 40671.7 0.059009 0.059009 0.059009 15:43:19 root.eng.brq.redhat.com: /home/NFS $ ./run.sh 17,18c17,18 < Each test below will take on the order of 48653 microseconds. < (= 48653 clock ticks) --- > Each test below will take on the order of 29855 microseconds. > (= 29855 clock ticks) 27,30c27,30 < Copy: 25256.9 0.063349 0.063349 0.063349 < Scale: 28167.5 0.056803 0.056803 0.056803 < Add: 31327.1 0.076611 0.076611 0.076611 < Triad: 33849.9 0.070901 0.070901 0.070901 --- > Copy: 36158.2 0.044250 0.044250 0.044250 > Scale: 37312.6 0.042881 0.042881 0.042881 > Add: 41475.2 0.057866 0.057866 0.057866 > Triad: 41585.1 0.057713 0.057713 0.057713 Created attachment 890505 [details] Results with 2.6.32-431.18.1.el6.bz1065304.x86_64 kernel on 4 NUMA system (In reply to Peter K from comment #24) > Sorry for a stupid question, do the above updates indicate > that this bug will not be fixed in a normal 6.5 kernel update > but as EUS for 6.5 (and presumably fixed from day 1 in 6.6)? The fix is planned for RHEL 6.5 as well - see BZ1091826 Could you paste the cpu powers on your test machine? (Ideally before and after the test regresses.) crash> p node_domains PER-CPU DATA TYPE: struct static_sched_domain per_cpu__node_domains; PER-CPU ADDRESSES: [0]: $address [...] crash> p ((struct static_sched_domain *)0x$address)->sd.groups $3 = (struct sched_group *) $groups_address crash> list -s sched_group $groups_address $wanted_output Thanks. --- If you wish to verify that the groups are being used: (through percpu runqueues) crash> p runqueues crash> p ((struct rq *)0x$address)->sd->parent->parent->groups This command should print the same address as the above one. Number of parents could be different; the right one has 'SD_LV_NODE' in it's 'level', which can be queried instead of 'groups'. (I don't have NUMA at hand.) Results with 2.6.32-431.18.1.el6.bz1065304_revert.x86_64 kernel Quick summary: cpu powers do not change second and subsequent runs are slower than the first run User time grows from 0m40.312s to 0m47.628s Details are bellow. I will test it with RHEL6.4 kernel to see if 6.4 kernel behaves the same way or not. Jirka After FRESH boot ========================================================================= crash> p ((struct static_sched_domain *)0xffff880028210ba0)->sd.groups $3 = (struct sched_group *) 0xffff88023c676000 crash> list -s sched_group 0xffff88023c676000 ffff88023c676000 struct sched_group { next = 0xffff88023c68ec00, cpu_power = 7062, cpu_power_orig = 0, cpumask = 0xffff88023c676010 } ffff88023c68ec00 struct sched_group { next = 0xffff88023c68e800, cpu_power = 7056, cpu_power_orig = 0, cpumask = 0xffff88023c68ec10 } ffff88023c68e800 struct sched_group { next = 0xffff88023c68e400, cpu_power = 7056, cpu_power_orig = 0, cpumask = 0xffff88023c68e810 } ffff88023c68e400 struct sched_group { next = 0xffff88023c676000, cpu_power = 7056, cpu_power_orig = 0, cpumask = 0xffff88023c68e410 $ more run.sh #!/bin/bash LOG=$(uname -r).log time ./stream.1e8_RHEL65 > ./stream.1e8_RHEL65.${LOG} CPUS=$( lscpu | grep line | awk -F':' '{print $2}' | xargs) time GOMP_CPU_AFFINITY=${CPUS} ./stream.1e8_RHEL65 > ./stream.1e8_RHEL65.GOMP_CPU_AFFINITY_${CPUS}.${LOG} diff ./stream.1e8_RHEL65.${LOG} ./stream.1e8_RHEL65.GOMP_CPU_AFFINITY_${CPUS}.${LOG} $ ./run.sh real 0m1.417s user 0m40.312s sys 0m6.991s real 0m1.305s user 0m35.200s sys 0m6.270s 17,18c17,18 < Each test below will take on the order of 30066 microseconds. < (= 30066 clock ticks) --- > Each test below will take on the order of 28978 microseconds. > (= 28978 clock ticks) 27,30c27,30 < Copy: 35564.2 0.044989 0.044989 0.044989 < Scale: 36184.3 0.044218 0.044218 0.044218 < Add: 39718.0 0.060426 0.060426 0.060426 < Triad: 39964.8 0.060053 0.060053 0.060053 --- > Copy: 35082.4 0.045607 0.045607 0.045607 > Scale: 37254.3 0.042948 0.042948 0.042948 > Add: 41432.2 0.057926 0.057926 0.057926 > Triad: 41663.0 0.057605 0.057605 0.057605 ========================================================================== SECOND RUN ========================================================================== crash> list -s sched_group 0xffff88023c676000 ffff88023c676000 struct sched_group { next = 0xffff88023c68ec00, cpu_power = 7068, cpu_power_orig = 0, cpumask = 0xffff88023c676010 } ffff88023c68ec00 struct sched_group { next = 0xffff88023c68e800, cpu_power = 7056, cpu_power_orig = 0, cpumask = 0xffff88023c68ec10 } ffff88023c68e800 struct sched_group { next = 0xffff88023c68e400, cpu_power = 7056, cpu_power_orig = 0, cpumask = 0xffff88023c68e810 } ffff88023c68e400 struct sched_group { next = 0xffff88023c676000, cpu_power = 7056, cpu_power_orig = 0, cpumask = 0xffff88023c68e410 } $ ./run.sh real 0m1.627s user 0m47.628s sys 0m6.326s real 0m1.417s user 0m38.632s sys 0m6.469s 17,18c17,18 < Each test below will take on the order of 47596 microseconds. < (= 47596 clock ticks) --- > Each test below will take on the order of 29953 microseconds. > (= 29953 clock ticks) 27,30c27,30 < Copy: 25982.5 0.061580 0.061580 0.061580 < Scale: 23853.5 0.067076 0.067076 0.067076 < Add: 29196.4 0.082202 0.082202 0.082202 < Triad: 30075.2 0.079800 0.079800 0.079800 --- > Copy: 35174.0 0.045488 0.045488 0.045488 > Scale: 35811.8 0.044678 0.044678 0.044678 > Add: 39256.7 0.061136 0.061136 0.061136 > Triad: 39515.2 0.060736 0.060736 0.060736 We've noticed that stream does not always run at the same speed presumably due to other problems. For this reason we switched to the testing described in comment 2 which is simpler and more clear. We also noted that the fixed kernel still sometimes fails to place things correctly (see "NOTE:" in comment 9). This happens on other 6.x we've tested too and is belived to be a different bug. Results with RHEL6.4 kernel 2.6.32-358.el6.x86_64 on RHEL6.5 Quick summary: cpu powers as reported by crash utility: do not change second and subsequent runs are slower than the first run User time grows from 35s for the run after the fresh boot to 47s in second and all subsequent runs. Results are same as with 2.6.32-431.18.1.el6.bz1065304.x86_64 kernel and the issue reported in this BZ is solved. Now looking into why the performance goes down between first and second run of stream benchmark. Jirka Details are bellow. After FRESH reboot: ./run.sh real 0m1.323s user 0m35.419s sys 0m6.858s real 0m1.332s user 0m36.008s sys 0m6.592s 17,18c17,18 < Each test below will take on the order of 30553 microseconds. < (= 30553 clock ticks) --- > Each test below will take on the order of 30068 microseconds. > (= 30068 clock ticks) 27,30c27,30 < Copy: 35004.1 0.045709 0.045709 0.045709 < Scale: 36458.9 0.043885 0.043885 0.043885 < Add: 40872.5 0.058719 0.058719 0.058719 < Triad: 41117.1 0.058370 0.058370 0.058370 --- > Copy: 35226.8 0.045420 0.045420 0.045420 > Scale: 35832.7 0.044652 0.044652 0.044652 > Add: 39426.3 0.060873 0.060873 0.060873 > Triad: 39856.4 0.060216 0.060216 0.060216 ========================================================== SECOND RUN ========================================================== $ ./run.sh real 0m1.584s user 0m47.869s sys 0m6.302s real 0m1.310s user 0m35.817s sys 0m6.290s 17,18c17,18 < Each test below will take on the order of 39979 microseconds. < (= 39979 clock ticks) --- > Each test below will take on the order of 29968 microseconds. > (= 29968 clock ticks) 27,30c27,30 < Copy: 24987.2 0.064033 0.064033 0.064033 < Scale: 25594.3 0.062514 0.062514 0.062514 < Add: 31146.1 0.077056 0.077056 0.077056 < Triad: 31230.1 0.076849 0.076849 0.076849 --- > Copy: 35179.6 0.045481 0.045481 0.045481 > Scale: 35883.3 0.044589 0.044589 0.044589 > Add: 39096.8 0.061386 0.061386 0.061386 > Triad: 40985.0 0.058558 0.058558 0.058558 *** Bug 1069256 has been marked as a duplicate of this bug. *** Patch(es) available on kernel-2.6.32-461.el6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-1392.html |