From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322) Description of problem: Since the "upgrade" two weeks ago to rhel 3, we have (every few days) had to reboot the login nodes. They hang, user sessions stall, new users may or may not get a request for password. "top" shows usage is low, except for iowait near 100% on all processors. Typically, both login nodes stall and are rebooted. The redhat 7.3 fileserver node does not require rebooting. Neither do the rhel 3 computational nodes. Other cluster sys admins we've talked to report a similar problem, but with lesser frequency. We do not know how to identify which user job (if any) is generating a large number of I/O requests. Recently we have added a few more mounted file systems which are accessible to all 130 or so computational nodes. Our file systems are NFS mounted, connected by GiGE switches. What other information would be helpful to you? Version-Release number of selected component (if applicable): How reproducible: Didn't try Actual Results: We are currently trying some IOzone tests of the file system(s) to see if we can reproduce the error. Note though that the nodes stalling are not the fileservers. Additional info:
I'm assuming your using NFS to server to serve a large computational cluster and some of the clients are getting hung? if this is the case please post a system traces (i.e. echo t > /proc/sysrq-trigger) of both the server and client (assuming they are linux clients). Note: To enable system traces either edit /etc/sysctl.conf and set kernel.sysrq to 1 or 'echo 1 > /proc/sys/kernel/sysrq'. If configured correctly, there will be a system trace of every process in /var/log/messages.