Description of problem: We have a *production* server running RHEL AS3 Update4 with all errata packages installed. Kernel running was 2.4.21-27.0.1.ELsmp. Same problem with other previous kernels. Periodically, about 10 to 15 days, load average gets up to 200-300 without reason, blocking connections. With a session openened before that load average increment, we can see through top command no process creating such work. This server is running a web server with remote ODBC connection. Web pages dinamically generated are shown in these situations but no other way of connection is available (ssh, telnet, ftp). I attach SysRq log M, W, T for this situation. The server is running kernel 2.4.21-27.0.2.ELsmp now. Version-Release number of selected component (if applicable): kernel-2.4.21-27.0.1.ELsmp How reproducible: n/a Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Fujitsu-Siemens PRIMERGY RX300 Intel(R) Xeon(TM) CPU 3.06GHz 4 GB RAM 2 x 73GB internal, SW RAID (md) 7 x 146GB external, HW RAID on an Description of problem: We have a *production* server running RHEL AS3 Update4 with all errata packages installed. Kernel running was 2.4.21-27.0.1.ELsmp. Same problem with other previous kernels. Periodically, each 10 to 15 days, load average gets up to 200-300 without reason, blocking connections. With a session openened before that load average increment, we can see through top command no process creating such work. This server is running a web server with remote ODBC connection. Web pages dinamically generated are shown in these situations but no other way of connection is available (ssh, telnet, ftp). I attach SysRq log M, W, T for this situation. After reboot, the server is running kernel 2.4.21-27.0.2.ELsmp. Version-Release number of selected component (if applicable): kernel-2.4.21-27.0.2.ELsmp How reproducible: n/a Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Fujitsu-Siemens PRIMERGY RX300 Intel(R) Xeon(TM) CPU 3.06GHz 4 GB RAM 2 x 73GB internal, SW RAID (md) 7 x 146GB external, HW RAID on an aacraid
Created attachment 110141 [details] SysRq log for M, W, T
Jose, unfortunately the above attachment does not show a system with a high load average. In this case both CPUs were running the idle loop and all other processes were blocked. In addition there was no memory deficit. Can you get the system in this state and get a "vmstat 1" and "top" output so I can see if they agree? Thanks, Larry Woodman
Created attachment 111564 [details] Capture file with "vmstat 1" and top Hello Larry, I've caught the server in this state again and I did both "vmstat 1" and "top". I include the capture as an attatchment. Looking through the capture, there are a lot of processes from crond, grouped by pairs parent-child which could be leading to the reported problem. Through a "strace" the child process does nothing while the parent process is waiting for a read, so both are "iddle". I have tried to kill these processes but only the parent processes have died. With a "lsof" from one of the remaining child processes, I've seen that it was using the "audit" feature, so I have stopped the audit service and all the child processes from crond have died. Then I have restarted crond and this has made the system come back to normal load average. Now I have disabled the audit service and restarted the system so I can test whether audit is responsible or not of the high load average. If so, I guess I should report it as an audit package bug, shouldn't I?
Please turn auditing off unless you want to run in a CAPP EAL3 environment. Auditing is enabled by default and it will impact system performance. Larry Woodman