Bug 178414 - Bad: Load Average goes very high
Bad: Load Average goes very high
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Larry Woodman
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-01-20 05:58 EST by Rajdeep Sengupta
Modified: 2008-08-02 19:40 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-10-19 14:48:29 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rajdeep Sengupta 2006-01-20 05:58:26 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2

Description of problem:
We have a 16gb AMD Opteron based server with dual CPU..
The load average of this machine increases with time and it is not dependent on the load. The load average touches more than 1000 to 1200 with 2-3 days, though the there is no big process on the server and the CPU is 98% free.
As a result the telnet or rlogin to this machine does not work and we have to physically reboot the server. This is giving lot of pain.

The details of the OS and machine are:
 
Linux galaxy 2.4.21-27.ELsmp #1 SMP Wed Dec 1 21:53:57 EST 2004 x86_64 x86_64 x86_64 GNU/Linux

 15:47:15  up 1 day,  1:25,  3 users,  load average: 1095.47, 1090.55, 1079.06
 15:47:21  up 1 day,  1:25,  3 users,  load average: 1095.59, 1090.66, 1079.16
 15:47:26  up 1 day,  1:25,  3 users,  load average: 1095.78, 1090.78, 1079.26
 15:47:31  up 1 day,  1:25,  3 users,  load average: 1095.96, 1090.90, 1079.36
 15:47:36  up 1 day,  1:25,  3 users,  load average: 1096.12, 1091.02, 1079.46
2249 processes: 2247 sleeping, 1 running, 1 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    0.1%    0.0%    1.1%   0.0%     0.0%    0.0%   98.6%
           cpu00    0.3%    0.0%    2.3%   0.0%     0.0%    0.0%   97.2%
           cpu01    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
Mem:  16117096k av,  815884k used, 15301212k free,       0k shrd,   74632k buff
                    398836k actv,   55424k in_d,     520k in_c
Swap: 10241428k av,    8120k used, 10233308k free                  253016k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 3437 root      16   0  3180 3180  1072 R     1.3  0.0   0:00   0 top
    1 root      15   0   520  480   448 S     0.0  0.0   0:04   0 init
    2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
    3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
    4 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 keventd
    5 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
    6 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
    9 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 bdflush
    7 root      15   0     0    0     0 SW    0.0  0.0   1:34   0 kswapd
    8 root      15   0     0    0     0 SW    0.0  0.0   0:18   1 kscand
   10 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 kupdated
   11 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 mdrecoveryd
   21 root      15   0     0    0     0 SW    0.0  0.0   0:01   0 kjournald


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. run some memory hungry process and load average increase to 8-10
2. after the process complete keep the machine as it some more time, but the load avearage will go on increasing
3. After 2 days the server load will be 1000 or more and the machine do not allow fresh login.

  

Additional info:
Comment 1 Ernie Petrides 2006-01-20 15:10:02 EST
Could you please report whether this problem occurs on U6 or later?

The latest released kernel is 2.4.21-37.0.1.EL, which is the post-U6
security erratum released yesterday.

Thanks in advance.
Comment 2 Larry Woodman 2006-01-20 15:34:53 EST
I have never seen this with any of our x86_64 systems running RHEL3(or any other
hardware/kernel combination for that matter) internally.  What applocations are
you running on this system and it it more than one system that displays this
problem ?

Thanks, Larry Woodman
Comment 3 Rajdeep Sengupta 2006-01-23 01:02:39 EST
Ok,
I will download the U6 version and check.

To answer Larry's question
Basically we have two such servers, one with 8gb and the other with 16gb.
Both are having same problem

In these servers we run ASIC designs on our EDA software. Normally, I do not
find the same issue with Sun server, it happens only in linux that the load
average goes on increasing.
For example if you run one test the load say becomes 3, once the test run is
complete, the server load average do not reduces, though you will find that the
CPU or I/o wait or Memory is free.
Now you run another test and the Load average increase from 3-8, again the load
average does not decrease once that test is complete and the machine is free.

This goes on and soon it becomes 1000, though the machine is free. And then
those who are logged in can work slowely, but any fresh login hangs..
Comment 4 Larry Woodman 2006-12-08 09:00:50 EST
If this is still a problem, please get me an AltSysrq-M, an AltSysrq-W and an
AltSysrq-T output so I can see what the system is doing.

Thanks, Larry Woodman
Comment 5 Rajdeep Sengupta 2006-12-13 05:52:54 EST
What you meant by AltSysrq-M, an AltSysrq-W etc. how to run this commands? 
please send me the details
Comment 6 RHEL Product and Program Management 2007-10-19 14:48:29 EDT
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.