Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 674389

Summary:

uneven cpuset scheduling

Product:

Red Hat Enterprise Linux 5

Reporter:

Travis Gummels <tgummels>

Component:

kernel

Assignee:

Peter Zijlstra <pzijlstr>

Status:

CLOSED NOTABUG

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.6

CC:

james.brown, lwang, mgrondona, mingo

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-08-23 16:09:34 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
packaged needed for reproducer	none
second package needed for the reproducer	none
reproducer script	none

Description Travis Gummels 2011-02-01 17:03:26 UTC

Description of problem:

LLNL is using cpuset's to confine a process to a specific set of cpu's.  For some numbers of cpu's per set the behaviour is as expected with one process per cpu.  For other numbers of cpu's per set they are seeing 1 or 2 of the cpu's idling while a few others are oversubscribed.

This issue is negatively impacting LLNL production.

Version-Release number of selected component (if applicable):

Red Hat Enterprise Linux Server release 5.6 (Tikanga)
Kernel 2.6.18-238.el5 on an x86_64

How reproducible:

1) Install pdsh, required for reproducer script below.
2) Start the reproducer script.
3) Monitor with top.

Reproducer Script:

# hype149 /tmp > cat cpuset-test.sh
#!/bin/bash

# Tweak CPUS and NCPUS to change outcome.
CPUS=3-11
NCPUS=9

CPUSETDIR=/dev/cpuset
TESTID=cpuset-test-$$
CPUSET=${CPUSETDIR}/${TESTID}

cleanup ()
{
[ -d $CPUSET ] || return
#
# Jump back out of cpuset
#
echo $$ > /dev/cpuset/tasks
rmdir $CPUSET
}

die () { echo "cpuset-test: $#@" >&2; cleanup; exit 1; }

mkdir $CPUSET || die "Failed to create cpuset at $CPUSET"

echo $CPUS > $CPUSET/cpus || die "Failed to populate cpuset"
cat /dev/cpuset/mems > $CPUSET/mems

echo $$ > $CPUSET/tasks || die "Failed to add myself to $TESTID"


#
# Compile a busy loop:
#
echo "int main (int ac, char **av) { while (1) {}; }" >/tmp/busy.c
gcc -o/tmp/busy /tmp/busy.c

[ -f /tmp/busy ] || die "failed to create busy loop program"

#
# Execute NCPUS busy loops in parallel:
#
pdsh -f$NCPUS -w "[$CPUS]" -Rexec /tmp/busy

cleanup 

# End Reproducer Script

Actual results:

For some number of cpu sets the processes are not evenly distributed.

top - 17:06:40 up 7 min,  2 users,  load average: 7.30, 2.62, 0.95
Tasks: 401 total,  10 running, 391 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.7%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.7%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  20536192k total,   610488k used, 19925704k free,    31744k buffers
Swap: 22577144k total,        0k used, 22577144k free,   378156k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 6312 root      25   0  3664  324  256 R 100.1  0.0   1:40.78 busy              
 6313 root      25   0  3664  324  256 R 100.1  0.0   1:40.78 busy              
 6316 root      25   0  3664  324  256 R 100.1  0.0   1:40.78 busy              
 6318 root      25   0  3664  320  256 R 100.1  0.0   1:40.78 busy              
 6321 root      25   0  3664  324  256 R 100.1  0.0   1:40.77 busy              
 6324 root      25   0  3664  320  256 R 100.1  0.0   1:40.77 busy              
 6323 root      25   0  3664  320  256 R 99.8  0.0   1:40.76 busy               
 6322 root      25   0  3664  324  256 R 50.2  0.0   0:50.38 busy               
 6319 root      25   0  3664  324  256 R 49.9  0.0   0:50.39 busy               
 6341 root      15   0 13016 1540  944 R  0.0  0.0   0:00.23 top    


Expected results:

Processes are evenly distributed across cpus in the cpuset.

Additional info:

Comment 1 Travis Gummels 2011-02-01 17:14:38 UTC

=== In Red Hat Customer Portal Case 00412338 ===
--- Comment by Woodard, Ben on 1/31/2011 1:12 PM ---

The problem doesn't appear with RHEL6.

top - 13:12:00 up  1:04,  3 users,  load average: 9.76, 5.52, 2.29
Tasks: 458 total,  11 running, 447 sleeping,   0 stopped,   0 zombie
Cpu0  : 88.7%us, 10.9%sy,  0.0%ni,  0.0%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.4%us,  0.4%sy,  0.0%ni, 99.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  20459104k total,  1469176k used, 18989928k free,    24920k buffers
Swap: 22691832k total,        0k used, 22691832k free,   715064k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 6843 root      20   0  3828  352  276 R 100.0  0.0   4:21.00 busy              
 6846 root      20   0  3828  352  276 R 100.0  0.0   4:21.00 busy              
 6847 root      20   0  3828  352  276 R 100.0  0.0   3:27.40 busy              
 6850 root      20   0  3828  352  276 R 100.0  0.0   4:21.00 busy              
 6851 root      20   0  3828  348  276 R 100.0  0.0   4:21.00 busy              
 6852 root      20   0  3828  348  276 R 100.0  0.0   4:20.96 busy              
 6842 root      20   0  3828  352  276 R 99.8  0.0   4:20.99 busy               
 6844 root      20   0  3828  352  276 R 99.8  0.0   3:27.39 busy               
 6853 root      20   0  3828  352  276 R 99.8  0.0   4:20.96 busy               
 6860 root      20   0  430m 129m  12m R 99.8  0.6   0:28.77 yum

Comment 2 Travis Gummels 2011-02-01 17:18:38 UTC

=== In Red Hat Customer Portal Case 00412338 ===
--- Comment by Woodard, Ben on 1/28/2011 3:37 PM ---

This is with: 2.6.18-238.el5

a way to view this is split out the cores in top

Then you see:

Tasks: 401 total,  10 running, 391 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

note how core 6 is unused.

Comment 3 Travis Gummels 2011-02-01 17:19:01 UTC

=== In Red Hat Customer Portal Case 00412338 ===
--- Comment by Woodard, Ben on 1/28/2011 5:09 PM ---

This is with the backport of the patch from http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=908a7c1b9b80d06708177432020c80d147754691;hp=cd79007634854f9e936e2369890f2512f94b8759 applied to the kernel

top - 17:06:40 up 7 min,  2 users,  load average: 7.30, 2.62, 0.95
Tasks: 401 total,  10 running, 391 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.7%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.7%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  20536192k total,   610488k used, 19925704k free,    31744k buffers
Swap: 22577144k total,        0k used, 22577144k free,   378156k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 6312 root      25   0  3664  324  256 R 100.1  0.0   1:40.78 busy              
 6313 root      25   0  3664  324  256 R 100.1  0.0   1:40.78 busy              
 6316 root      25   0  3664  324  256 R 100.1  0.0   1:40.78 busy              
 6318 root      25   0  3664  320  256 R 100.1  0.0   1:40.78 busy              
 6321 root      25   0  3664  324  256 R 100.1  0.0   1:40.77 busy              
 6324 root      25   0  3664  320  256 R 100.1  0.0   1:40.77 busy              
 6323 root      25   0  3664  320  256 R 99.8  0.0   1:40.76 busy               
 6322 root      25   0  3664  324  256 R 50.2  0.0   0:50.38 busy               
 6319 root      25   0  3664  324  256 R 49.9  0.0   0:50.39 busy               
 6341 root      15   0 13016 1540  944 R  0.0  0.0   0:00.23 top    

Note how one of the processors is unused and at the end there are two busy processes which are getting 50% time vs 100% time.

Comment 13 Mark A. Grondona 2011-02-09 15:52:16 UTC

Here is a detailed description of the case we are hitting on
our quad-core quad-socket opterons when we run with cpuset or
cpu affinity set to CPUs 3-11. Note that these CPUs extend across
3 numa nodes 0-3,4-7,8-11, but that only CPU3 is in the cpus_allowed
of the running tasks.

This is my current understanding and may be from slightly flawed to
way off base, sorry about any mistakes.

We end up in a state where the scheduler is running 3 tasks on each
of the groups 0-3,4-7,8-11. In this specific case, CPU4 and CPU8
are idle, while CPU3 has 3 tasks running on it, each getting 33% cpu.

Looking at find_busiest_group(), when called for sched domain 0-15
by CPU4, we have the following calculations in the do/while loop for
each group:

 group 4-7  (this_group) nr_running=3 avg_load=384 group->cpu_power=128
 group 8-11              nr_running=3 avg_load=384 group->cpu_power=128
 group 12-15             nr_running=3 avg_load=0   group->cpu_power=128
 group 0-3               nr_running=3 avg_load=384 group->cpu_power=128

Note that for this case, since the avg_load of all three busy groups
is the same, find_busiest_group will pick 8-11 as the busiest group
since it is the first case where

 (avg_load > max_load && sum_nr_running > group_capacity) {

This is because sched_mc_power_savings is not set, so group_capacity == 1,
meaning that the scheduler will try to spread tasks around groups instead
of filling one group to its real capacity before migrating tasks to the
next group.

After picking group 8-11 as busiest, however, the scheduler rightly does
nothing since the load on 8-11 is the same as this group's load
(3 tasks total):

    if (!busiest || this_load >= max_load || busiest_nr_running == 0)
            goto out_balanced;



There is a patch that went in upstream that ostensibly tries to rectify
the situation where there is one unbalanced group due to cpus_allowed mask:

 http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.37.y.git;a=commit;h=908a7c1b9b80d06708177432020c80d147754691

This patch detects a group that is likely imbalanced by checking
the difference between the most loaded and least loaded cpu in that
group, and if the difference is greater than SCHED_LOAD_SCALE (1 task),
then likely the group cannot balance itself and the patch sets a 
group_imb flag that is used in addition to the test for
sum_nr_running > group_capacity to find the busiest group as seen
in this hunk:

@@ -2519,11 +2530,12 @@ find_busiest_group(struct sched_domain *
            this_nr_running = sum_nr_running;
            this_load_per_task = sum_weighted_load;
        } else if (avg_load > max_load &&
-              sum_nr_running > group_capacity) {
+              (sum_nr_running > group_capacity || __group_imb)) {
            max_load = avg_load;
            busiest = group;
            busiest_nr_running = sum_nr_running;
            busiest_load_per_task = sum_weighted_load;
+           group_imb = __group_imb;
        }

 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)


However, this patch only works when sched_mc_power_savings is set,
 because otherwise the group_capacity for all groups is 1, and
sum_nr_running is > group_capacity for all 3 loaded groups, so
group 8-11 is again detected as the "busiest" group.

In fact, even if group 0-3 _is_ detected as the busiest group,
find_busiest_group will still return out_balanced for this case
since in the conditional:

     if (!busiest || this_load >= max_load || busiest_nr_running == 0)
        goto out_balanced;

this_load >= max_load is always true since the load of all three
busy groups is the same (3 tasks).

What it seems like we need is a way to artificially increase the "load"
of the 0-3 group with CPU3 oversubscribed.

Comment 22 Ben Woodard 2011-02-10 23:09:39 UTC

Created attachment 478143 [details]
packaged needed for reproducer

Comment 23 Ben Woodard 2011-02-10 23:11:46 UTC

Created attachment 478144 [details]
second package needed for the reproducer

Comment 24 Ben Woodard 2011-02-10 23:12:22 UTC

Created attachment 478145 [details]
reproducer script

Comment 27 RHEL Program Management 2011-06-20 22:31:12 UTC

This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 28 Linda Wang 2011-08-23 16:09:34 UTC

okay. per comment#26. close this issue as won't fix..