Bug 71035

Summary: Scheduler puts computationally bound processes on same die with hyperthreading
Product: [Retired] Red Hat Linux Reporter: Ben Woodard <woodard>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: djuran, joshua.bakerlepain
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-02-10 23:33:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
computationally bound process which is useful in illustrating the problem
none
data file needed with test program none

Description Ben Woodard 2002-08-08 01:37:11 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.0) Gecko/20020606

Description of problem:

Performance problems arise when hyperthreading is enabled and the number of
computational threads is less than the number of available logical cpus on a
node.  This is because most of the time the cpu-intensive threads will use
logical cpus that share a physical cpu, while the second physical cpu remains
relatively idle.  The table below illustrates this.

CPU relative performance, SPPM (higher is better)

2 nodes: 2 physical, 4 logical cpus per node; 2.2 GHz Xeon, 512K L2 cache,
hyperthreading enabled

Job config      Relative performance
                of each CPU
-------------   -----------------------------
2 mpi procs             1

"2 mpi procs,
2 threads/proc"         0.710037175

"2 mpi procs,
4 threads/proc"         1.299319728

4 mpi procs             0.636136553

"4 mpi procs,
2 threads/proc"         1.290540541

8 mpi procs             1.17992278


You can see from the chart above that in three cases, hyperthreading is working
well, with performance boosts of 30%, 29%, and 18% for each individual CPU above
the base case of 2 mpi processes, which equates to one process per node. 
However, in two of the cases, only 2 logical cpus are used for computation on
each 4 logical cpu node, and performance per CPU is much worse than the base
case here even though it should be essentially equivalent.

This problem is not as visible if MPI, and therefore the communication threads,
are removed from the picture.  Consider the following runs on one of these nodes
with only threading enabled:

Job config      Relative performance
                of each CPU
------------    -----------------------------
1 thread        1
2 threads       0.980208851
4 threads       1.249968294

Here the 2-thread case behaves (nearly) as expected even though all logical CPUs
are not utilized, and the 4-thread case delivers a 25% hyperthreading
performance boost.

So the problem with MPI enabled is that essentially idle threads are consuming
whole physical CPUs, while 2 threads that are using 98-99% of CPU resources are
sharing a physical CPU.  It would be ideal if the kernel was able to recognize
logical CPU - physical CPU relationships, and perferentially task-switched
CPU-intensive processes/threads to logical CPUs that were on idle physical CPUs.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.obtain a dual processor machine that can do hyperthreading
2.turn on hyperthreading in the bios
3.create two computationally bound processes

Actual Results:  Both computationally bound processes end up on cpu0 and cpu1
which are on the same physical die.

Expected Results:  The computationally bound processes should run on cpu0 and
cpu2 which are seperate physical devices.

Additional info:

This impacts performance substantially when you have fewer computationally
intensive processes than you have hyperthreaded CPUs.

Comment 1 Ben Woodard 2002-08-08 02:14:33 UTC
Please sit on this issue for a little bit. We may have uncovered some additional
facts which may shed light on the issue.

Comment 2 Ben Woodard 2002-08-09 23:36:15 UTC
Created attachment 69824 [details]
computationally bound process which is useful in illustrating the problem

Comment 3 Ben Woodard 2002-08-09 23:37:23 UTC
Created attachment 69825 [details]
data file needed with test program

Comment 4 Ben Woodard 2002-08-12 23:14:53 UTC
It sort of seems like the kernel hyperthreading capability may have been damaged
by the linux-2.4.17-selected-ac-bits.patch. I noticed that in the stock 2.4.18
there are two places where the varible cpu_sibling_map is used one is in the
smpboot.c and the other is in sched.c. smpboot.c seems to store the fact that
two cpus are actually related in the variable and then sched.c makes use of that
piece of information when selecting an idle CPU. 

In our kernel however, smpboot.c still stores the fact that the two CPUs are
related however the scheduler doesn't seem to make use of that fact.

Caveat: I currently do not have root access to a machine that has hyperthreading
capability and so I am unable to verify that the way that the 2.4.18 kernel uses
cpu_sibling_map yeilds the correct behavior. I expect to have root access on the
needed machine within a day or so.

The place in the stock 2.4.18 kernel that I'm refering to is:

   259                  /*
   260                   * We use the first available idle CPU. This creates
   261                   * a priority list between idle CPUs, but this is not
   262                   * a problem.
   263                   */
   264                  if (tsk == idle_task(cpu)) {
   265  #if defined(__i386__) && defined(CONFIG_SMP)
   266                          /*
   267                           * Check if two siblings are idle in the same
   268                           * physical package. Use them if found.
   269                           */
   270                          if (smp_num_siblings == 2) {
   271                                  if (cpu_curr(cpu_sibling_map[cpu]) == 
   272                                      idle_task(cpu_sibling_map[cpu])) {
   273                                          oldest_idle = last_schedule(cpu);
   274                                          target_tsk = tsk;
   275                                          break;
   276                                  }
   277                                  
   278                          }
   279  #endif          

Because sched.c is so different in our kernels, I was unable to quickly find the
analogous location.

Comment 5 Ben Woodard 2002-08-12 23:24:09 UTC
OK we can unsuspend this one. I think I've gathered as much information as I can
about the problem as I can at this particular time.

Comment 6 Habig, Alec 2003-04-24 14:22:33 UTC
The problem is still present in RH9's 2.4.20 kernel, pretty much as Ben
describes in his most excellent bug report.  

Fixing this has apparently been done by Ingo's O(1) scheduler in the 2.5 series
kernel.  Chunks of this scheduler appear to have already been backported to 2.4.
 How easy would it be for the hyperthreading fixes to also be backported?  As
more and more hyperthread-enabled CPUs hit the market, this will become a bigger
issue.  Particularly since the Big New Feature of RH9 is the glibc threading
stuff, it seems that RH would be quite interested in getting the most out of
this new hardware gizmo.


Comment 7 David Juran 2004-02-06 15:23:09 UTC
Seems to be fixed in at least in kernel-smp-2.4.21-9.EL