Bug 146017 - high load average unresponsive server
high load average unresponsive server
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Larry Woodman
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-01-24 14:03 EST by Jose Traver
Modified: 2007-11-30 17:07 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-04-07 10:06:31 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
SysRq log for M, W, T (22.97 KB, application/x-gzip)
2005-01-24 14:07 EST, Jose Traver
no flags Details
Capture file with "vmstat 1" and top (8.13 KB, application/x-gzip)
2005-03-02 07:21 EST, Jose Traver
no flags Details

  None (edit)
Description Jose Traver 2005-01-24 14:03:14 EST
Description of problem:
We have a *production* server running RHEL AS3 Update4 with
all errata packages installed. Kernel running was 2.4.21-27.0.1.ELsmp.
 Same problem with other previous kernels.

Periodically, about 10 to 15 days, load average gets up to 200-300
without reason, blocking connections. With a session openened before
that load average increment, we can see through top command no process
creating such work.

This server is running a web server with remote ODBC connection. Web
pages dinamically generated are shown in these situations but no other
way of connection is available (ssh, telnet, ftp).

I attach SysRq log M, W, T for this situation. The server is running
kernel 2.4.21-27.0.2.ELsmp now. 

Version-Release number of selected component (if applicable):
kernel-2.4.21-27.0.1.ELsmp

How reproducible:
n/a

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Fujitsu-Siemens PRIMERGY RX300 
Intel(R) Xeon(TM) CPU 3.06GHz 
4 GB RAM
2 x 73GB internal, SW RAID (md)
7 x 146GB external, HW RAID on an Description of problem:
We have a *production* server running RHEL AS3 Update4 with
all errata packages installed. Kernel running was 2.4.21-27.0.1.ELsmp.
 Same problem with other previous kernels.

Periodically, each 10 to 15 days, load average gets up to 200-300
without reason, blocking connections. With a session openened before
that load average increment, we can see through top command no process
creating such work.

This server is running a web server with remote ODBC connection. Web
pages dinamically generated are shown in these situations but no other
way of connection is available (ssh, telnet, ftp).

I attach SysRq log M, W, T for this situation. After reboot, the
server is running kernel 2.4.21-27.0.2.ELsmp. 

Version-Release number of selected component (if applicable):
kernel-2.4.21-27.0.2.ELsmp

How reproducible:
n/a

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Fujitsu-Siemens PRIMERGY RX300 
Intel(R) Xeon(TM) CPU 3.06GHz 
4 GB RAM
2 x 73GB internal, SW RAID (md)
7 x 146GB external, HW RAID on an aacraid
Comment 1 Jose Traver 2005-01-24 14:07:21 EST
Created attachment 110141 [details]
SysRq log for M, W, T
Comment 3 Larry Woodman 2005-01-25 10:50:41 EST
Jose, unfortunately the above attachment does not show a system with a high load
average.  In this case both CPUs were running the idle loop and all other
processes were blocked.  In addition there was no memory deficit.  Can you get
the system in this state and get a "vmstat 1" and "top" output so I can see if
they agree?

Thanks, Larry Woodman


Comment 4 Jose Traver 2005-03-02 07:21:40 EST
Created attachment 111564 [details]
Capture file with "vmstat 1" and top

Hello Larry,
I've caught the server in this state again and I did both "vmstat 1" and "top".
I include the capture as an attatchment.

Looking through the capture, there are a lot of processes from crond, grouped
by pairs parent-child which could be leading to the reported problem. Through a
"strace" the child process does nothing while the parent process is waiting for
a read, so both are "iddle". I have tried to kill these processes but only the
parent processes have died.

With a "lsof" from one of the remaining child processes, I've seen that it was
using the "audit" feature, so I have stopped the audit service and all the
child processes from crond have died. Then I have restarted crond and this has
made the system come back to normal load average.

Now I have disabled the audit service and restarted the system so I can test
whether audit is responsible or not of the high load average. If so, I guess I
should report it as an audit package bug, shouldn't I?
Comment 5 Larry Woodman 2005-04-07 10:06:31 EDT
Please turn auditing off unless you want to run in a CAPP EAL3 environment.
Auditing is enabled by default and it will impact system performance.

Larry Woodman

Note You need to log in before you can comment on or make changes to this bug.