Bug 101465 - Kernel not performing under heavy load
Kernel not performing under heavy load
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
high Severity high
: ---
: ---
Assigned To: Ingo Molnar
Brian Brock
:
Depends On:
Blocks: 101028 103278
  Show dependency treegraph
 
Reported: 2003-08-01 11:26 EDT by Matt Pavlovich
Modified: 2007-11-30 17:06 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-10-11 11:26:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
show effective CPU performance (445 bytes, text/plain)
2003-08-08 05:52 EDT, Ingo Molnar
no flags Details

  None (edit)
Description Matt Pavlovich 2003-08-01 11:26:40 EDT
Description of problem:

On a HP/Compaq DL740 8*2Ghz XEON w/ 64Gb of RAM (Hyperthreading turned OFF),
when running 8 seti processes that have been 'niced', the total CPU utilization
is around 100% on each processor, but none of the processes are using more than
4% of CPU.  The number of units completed by the 8 seti processes is drastically
less than would be expected-- 2 days of testing.  The seti results reflect that
of each process only garnering a very small amount of CPU.  

This appears to be a VM or scheduler issue.

Version-Release number of selected component (if applicable):
RH AS Taroon-beta1

How reproducible:
100%

Steps to Reproduce:
1. Fire up mulitple seti processes on a multi-CPU system with more than 4gb of
RAM (to use bigmem kernel).
2. Track results for a few days, and watch CPU stats via top, mpstat.
3. Scratch head 
    
Actual results:
Very small number of Seti work units completed

Expected results: 
Very high number of Seti work units completed

Additional info:
See taroon-beta mailing list.  There is another individual that reproduced this
problem using the distributed.net client.  On a comparable 2 CPU system running
RH AS 2.1, results are completely different.  intr/s is ~260 per CPU, whereas w/
RH AS 3.0beta1, we are only seeing ~110 on ONE CPU.  Add'l, idle, system and
nice values are vastly different.  

Top output:
 10:17:29  up 2 days, 17:51,  1 user,  load average: 7.99, 7.98, 7.99
56 processes: 47 sleeping, 9 running, 0 zombie, 0 stopped
CPU0 states:  98.0% user   1.1% system   98.3% nice   0.0% iowait   0.0% idle
CPU1 states:  99.0% user   0.5% system   99.0% nice   0.0% iowait   0.0% idle
CPU2 states:  98.0% user   1.4% system   98.0% nice   0.0% iowait   0.0% idle
CPU3 states:  99.0% user   0.4% system   99.1% nice   0.0% iowait   0.0% idle
CPU4 states:  98.0% user   1.3% system   98.1% nice   0.0% iowait   0.0% idle
CPU5 states:  98.0% user   1.0% system   98.4% nice   0.0% iowait   0.0% idle
CPU6 states:  98.1% user   1.3% system   98.0% nice   0.0% iowait   0.0% idle
CPU7 states:  98.0% user   1.3% system   98.1% nice   0.0% iowait   0.0% idle
Mem:  65854864k av,  586772k used, 65268092k free,       0k shrd,  105496k buff
       279288k active,             153144k inactive
Swap: 4194192k av,       0k used, 4194192k free                  203868k cached
                                                                               
                          
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 1429 jhaynes   39  19 16360  15M   800 R N   1.9  0.0  88:10   2 setiathome2
 1431 jhaynes   39  19 15328  14M   800 R N   1.7  0.0 164:23   4 setiathome4
 1437 jhaynes   39  19 15328  14M   800 R N   1.5  0.0 126:45   6 setiathome6
 1439 jhaynes   39  19 15328  14M   800 R N   1.5  0.0 123:00   7 setiathome8
 1430 jhaynes   39  19 16356  15M   800 R N   1.3  0.0  92:29   0 setiathome3
 1428 jhaynes   39  19 15980  15M   800 R N   1.1  0.0  56:39   5 setiathome1
 1436 jhaynes   39  19 15328  14M   800 R N   0.9  0.0 162:39   1 setiathome5
 1438 jhaynes   39  19 16352  15M   800 R N   0.7  0.0 202:41   3 setiathome7

mpstat output:
[root@daldev35 addon]# mpstat -P ALL
Linux 2.4.21-1.1931.2.349.2.2.entbigmem (daldev35)      08/01/2003
 
10:18:08 AM  CPU   %user   %nice %system   %idle    intr/s
10:18:08 AM  all    0.01   96.18    3.24    0.58    112.50
10:18:08 AM    0    0.02   96.56    2.64    0.55    112.50
10:18:08 AM    1    0.00   97.23    2.18    0.62      0.00
10:18:08 AM    2    0.01   96.71    2.75    0.56      0.00
10:18:08 AM    3    0.00   97.31    2.10    0.62      0.00
10:18:08 AM    4    0.01   95.91    3.53    0.58      0.00
10:18:08 AM    5    0.00   96.94    2.56    0.54      0.00
10:18:08 AM    6    0.02   94.26    5.17    0.59      0.00
10:18:08 AM    7    0.03   94.49    4.96    0.56      0.00
Comment 1 Matt Pavlovich 2003-08-05 14:05:17 EDT
20030805 13:00:00 CST

I built a stock 2.4.21 kernel for this platform.  I rebooted the system and
started the processes.  The processes appear to be running as expected.
Comment 2 Ingo Molnar 2003-08-06 15:27:22 EDT
do you get this regression even if you boot with only one CPU? (numcpus=1 boot
option)
Comment 3 Ingo Molnar 2003-08-07 09:04:04 EDT
all the seti processes seem to be running fine, according to your top output. It
says near 100% reniced activity (ie. seti) on each CPU - this is what one would
expect. But is it correct that despite the seti processes using up tons of CPU
time, they dont do any real work?
Comment 4 Matt Pavlovich 2003-08-07 16:02:43 EDT
Correct, there is little work performed by the processes.  The seti processes
should show CPU utilization even though they are reniced.  It appears that the
kernel is just spinning on the niced processes and not getting any real work done.

I will try the system under a single CPU and report back.  
Comment 5 Ingo Molnar 2003-08-08 05:52:08 EDT
Created attachment 93514 [details]
show effective CPU performance

Your 'top' output suggests otherwise: it shows that 99% of CPU time is spent in
userspace (ie. in seti most likely).

i've attached loop_print.c, please run nr_cpus copies of it ('make loop_print')
- does the looping perform just as fast as when running only one copy? And the
same looping performance when running with just a single CPU? Same performance
if reniced?
Comment 6 Matt Pavlovich 2003-08-19 16:44:31 EDT
HP Proliant DL740 8x2.0Ghz CPU's 64Gb RAM 

I upgraded to the latest Kernel available through RHN and tried the loop_print
tests.  I received similar numbers of loops performed for each of single
instance, 8 instances, and 8 instances nice'd.   ~4218000 loops

Why does top show the CPU states at 100% and the per processor %CPU at 0.0 when
niced?  
Comment 7 Ingo Molnar 2003-08-20 08:06:02 EDT
when reniced then you should see the following 'top' line:

 CPU0 states:  99.1% user   0.2% system   99.7% nice   0.0% iowait   0.0% idle

when not reniced, it should show:

 CPU0 states:  99.8% user   0.2% system    0.0% nice   0.0% iowait   0.0% idle

does it work any other way for you?

do you still see the SETI slowdown, with the new kernel? If then is the slowdown 
a wall-clock slowdown? (ie. it does not finish the necessary work in 1 hour
wall-clock time?)
Comment 8 Matt Wilson 2003-09-10 19:01:08 EDT
any news on this?  it has been over 20 days sincce the last update
Comment 9 Matt Pavlovich 2003-09-11 11:28:25 EDT
I am trying to accurately guage the CPU performance.  We've had some problems
with Seti, but have narrowed it down.  

When the Seti processes run at 'nice 0' they perform b/w 40-44 units a day,
approx . 1 Unit every 4 hrs.  I am verifying that when run at 'nice 19' they
produce roughly half the results.  I will have solid data in 2 days.

Matt Pavlovich
Comment 10 Matt Pavlovich 2003-09-11 11:36:23 EDT
Odd note: All 8 Seti processes were running at 'nice 0'.  When I renice them to
19, one of them runs up to 20% System space.  Its the same process, and it will
stay that way even when he changes CPU's.  I've tried sending him back to '0',
then back to '19'.  No change.

CPU0 states:  75.0% user  24.0% system   75.4% nice   0.0% iowait   0.0% idle
CPU1 states:  96.0% user   3.1% system   96.3% nice   0.0% iowait   0.0% idle
CPU2 states:  98.0% user   1.2% system   98.2% nice   0.0% iowait   0.0% idle
CPU3 states:  97.1% user   1.1% system   97.2% nice   0.4% iowait   0.1% idle
CPU4 states:  97.0% user   2.0% system   97.4% nice   0.0% iowait   0.0% idle
CPU5 states:  98.0% user   1.4% system   98.0% nice   0.0% iowait   0.0% idle
CPU6 states:  98.0% user   1.3% system   98.1% nice   0.0% iowait   0.0% idle
CPU7 states:  99.0% user   0.3% system   99.2% nice   0.0% iowait   0.0% idle
Mem:  65854860k av, 65615008k used,  239852k free,       0k shrd,  346240k buff
                   1075976k actv,       0k in_d, 1340720k in_c
Swap: 4194192k av,       0k used, 4194192k free                 62998776k cached
                                                                               
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
12852 jhaynes   39  19 15572  15M   796 R N  24.1  0.0  2850m   0 setiathome2
12848 jhaynes   39  19 15336  14M   796 R N   3.3  0.0  2850m   1 setiathome8
12842 jhaynes   39  19 15856  15M   792 R N   2.1  0.0  2849m   4 setiathome4
12853 jhaynes   39  19 16292  15M   796 R N   1.9  0.0  2849m   5 setiathome3
12843 jhaynes   39  19 15964  15M   796 R N   1.7  0.0  2850m   6 setiathome5
12844 jhaynes   39  19 15936  15M   796 R N   1.7  0.0  2850m   2 setiathome6
12847 jhaynes   39  19 14892  14M   796 R N   1.1  0.0  2850m   3 setiathome7
12851 jhaynes   39  19 15872  15M   796 R N   0.5  0.0  2849m   7 setiathome1
Comment 11 Rik van Riel 2003-09-15 13:53:46 EDT
I think this was a problem with the accounting of nice time. Should be fixed in
a later kernel.
Comment 12 Bill Nottingham 2004-10-11 11:26:50 EDT
Closing MODIFIED bugs as fixed. Please reopen if the problem perists.

Note You need to log in before you can comment on or make changes to this bug.