Description of problem: We have a group of 10 systems to distribute builds (gcc). When these systems are under heavy load, the builds (make process) hang in a blocked, deadlocked state. The NFS servers do not show excessive load or any problems. NFS gcc source: NetApp Release 6.4.5P2 NFS build source: NetApp Release 6.5.2R1P9: The 10 build clients are exactly the same hardware and configuration. I'll be attaching the following from a system that is exhibiting this behavior: # lspci -vv # lsmod # cat /proc/meminfo # cat /proc/cpuinfo # uname -a And as much as I can capture from the console as possible. Unfortunately, we cannot recreate the hang on-demand. Please let me know if there is any other information I can provide that would be helpful. Version-Release number of selected component (if applicable): Redhat Enterprise Linux 3 Update 4 x86_64
Created attachment 109975 [details] Output from SysRq Output from SysRq attached: SysRq : Show CPUs SysRq : Show State SysRq : Show Memory SysRq : Crashing the kernel by request
Created attachment 109976 [details] Output from SysRq attached SysRq : Show CPUs SysRq : Show State SysRq : Show Memory
Created attachment 109977 [details] Output from SysRq attached: SysRq : Show CPUs SysRq : Show State SysRq : Show Memory
Created attachment 109978 [details] Additional system information From the system "sif029" # lspci -vv # lsmod # cat /proc/meminfo # cat /proc/cpuinfo # uname -a
Perhaps related, we have a system which uses qlogic FC HBAs to a CX600. Locally it works fine, over NFS using SMP the system crashes (multiple crash dumps uploaded see Service Request 366184). We found UP resolved the oops issue. Current crash dump and logs are included in that service request. This is with all kernels up to and including latest errata. For the longest time it was that the server would only hang but latest SMP kernels actually oops and produce a netdump.
Could this be related? https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=138182
Created attachment 110124 [details] Revertion of Sillydelete patch Yes, it appears rpciod is hanging in the same places as in bz# 138182. Please try the attached, which should elimnate the hang.
Two days, no hangs... looks good so far
Cool... thanks for the update!
Created attachment 113735 [details] perl script & C++ combo to hang nfs, plus systrace We have several rooms full of Dell Optiplex GX260's running RHEL3. I grabbed the latest kernel (2.4.21-27.0.4.ELsmp #1 SMP) and I can get nfs to hang reliably by compiling the attached C++ and using the attached perl script to run the C++ binary. I've also attached a systrace of the hung machine. Steve Dickson's attachment above (id=110124) seems to fix the problem but I sort of expected the patch to have been incorporated into this release.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html Please see bug 138182 comment #47 for more details. *** This bug has been marked as a duplicate of 138182 ***