Bug 146017 - high load average unresponsive server
Summary: high load average unresponsive server
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Larry Woodman
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-01-24 19:03 UTC by Jose Traver
Modified: 2007-11-30 22:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-04-07 14:06:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
SysRq log for M, W, T (22.97 KB, application/x-gzip)
2005-01-24 19:07 UTC, Jose Traver
no flags Details
Capture file with "vmstat 1" and top (8.13 KB, application/x-gzip)
2005-03-02 12:21 UTC, Jose Traver
no flags Details

Description Jose Traver 2005-01-24 19:03:14 UTC
Description of problem:
We have a *production* server running RHEL AS3 Update4 with
all errata packages installed. Kernel running was 2.4.21-27.0.1.ELsmp.
 Same problem with other previous kernels.

Periodically, about 10 to 15 days, load average gets up to 200-300
without reason, blocking connections. With a session openened before
that load average increment, we can see through top command no process
creating such work.

This server is running a web server with remote ODBC connection. Web
pages dinamically generated are shown in these situations but no other
way of connection is available (ssh, telnet, ftp).

I attach SysRq log M, W, T for this situation. The server is running
kernel 2.4.21-27.0.2.ELsmp now. 

Version-Release number of selected component (if applicable):
kernel-2.4.21-27.0.1.ELsmp

How reproducible:
n/a

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Fujitsu-Siemens PRIMERGY RX300 
Intel(R) Xeon(TM) CPU 3.06GHz 
4 GB RAM
2 x 73GB internal, SW RAID (md)
7 x 146GB external, HW RAID on an Description of problem:
We have a *production* server running RHEL AS3 Update4 with
all errata packages installed. Kernel running was 2.4.21-27.0.1.ELsmp.
 Same problem with other previous kernels.

Periodically, each 10 to 15 days, load average gets up to 200-300
without reason, blocking connections. With a session openened before
that load average increment, we can see through top command no process
creating such work.

This server is running a web server with remote ODBC connection. Web
pages dinamically generated are shown in these situations but no other
way of connection is available (ssh, telnet, ftp).

I attach SysRq log M, W, T for this situation. After reboot, the
server is running kernel 2.4.21-27.0.2.ELsmp. 

Version-Release number of selected component (if applicable):
kernel-2.4.21-27.0.2.ELsmp

How reproducible:
n/a

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Fujitsu-Siemens PRIMERGY RX300 
Intel(R) Xeon(TM) CPU 3.06GHz 
4 GB RAM
2 x 73GB internal, SW RAID (md)
7 x 146GB external, HW RAID on an aacraid

Comment 1 Jose Traver 2005-01-24 19:07:21 UTC
Created attachment 110141 [details]
SysRq log for M, W, T

Comment 3 Larry Woodman 2005-01-25 15:50:41 UTC
Jose, unfortunately the above attachment does not show a system with a high load
average.  In this case both CPUs were running the idle loop and all other
processes were blocked.  In addition there was no memory deficit.  Can you get
the system in this state and get a "vmstat 1" and "top" output so I can see if
they agree?

Thanks, Larry Woodman




Comment 4 Jose Traver 2005-03-02 12:21:40 UTC
Created attachment 111564 [details]
Capture file with "vmstat 1" and top

Hello Larry,
I've caught the server in this state again and I did both "vmstat 1" and "top".
I include the capture as an attatchment.

Looking through the capture, there are a lot of processes from crond, grouped
by pairs parent-child which could be leading to the reported problem. Through a
"strace" the child process does nothing while the parent process is waiting for
a read, so both are "iddle". I have tried to kill these processes but only the
parent processes have died.

With a "lsof" from one of the remaining child processes, I've seen that it was
using the "audit" feature, so I have stopped the audit service and all the
child processes from crond have died. Then I have restarted crond and this has
made the system come back to normal load average.

Now I have disabled the audit service and restarted the system so I can test
whether audit is responsible or not of the high load average. If so, I guess I
should report it as an audit package bug, shouldn't I?

Comment 5 Larry Woodman 2005-04-07 14:06:31 UTC
Please turn auditing off unless you want to run in a CAPP EAL3 environment.
Auditing is enabled by default and it will impact system performance.

Larry Woodman



Note You need to log in before you can comment on or make changes to this bug.