From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; WinNT4.0; en-US; rv:0.9.1) Gecko/20010607 Description of problem: I think this is actually a deadlock of some sort in the kernel, but the actual effect is that 'ps aux' hangs up and can't be kill -9'ed. 'strace ps aux' runs through a load of the processes and then gives: stat64("/proc/30945", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/30945/stat", O_RDONLY) = 7 read(7, (Interestingly, I could Ctrl-C out of this. I can't tell if the 'ps' process died, or just the 'strace', though, because 'ps' doesn't work. ;-) 'xosview' shows that one CPU spends almost 100% of its time doing 'system' work. (Our machine has 4 CPUs.) I don't know what triggers this; it's happened twice now, the first time after 2 weeks of uptime, this time after only about 3 days. I don't know if this occurs on uniprocessor machines or non-x86 machines. We have a quad-PIII-Xeon machine and that's what we're seeing it on. How reproducible: Sometimes Steps to Reproduce: When it happens, running 'ps aux' will hang. I haven't found a way to make it happen yet. Additional info:
Which kernel version is this ? 2.4.2-2 or the recently released 2.4.3-12 update kernel ?
bob:~: $ uname -a Linux bob 2.4.2-2smp #1 SMP Sun Apr 8 20:21:34 EDT 2001 i686 unknown bob:~: $ /sbin/lsmod Module Size Used by nfs 82816 1 (autoclean) nfsd 70976 8 (autoclean) lockd 53232 1 (autoclean) [nfs nfsd] sunrpc 66352 1 (autoclean) [nfs nfsd lockd] eepro100 17232 1 (autoclean) ipchains 41632 0 (unused) usbcore 52416 1 aic7xxx 136336 0 megaraid 21712 10 sd_mod 11744 10 scsi_mod 98624 3 [aic7xxx megaraid sd_mod] (usbcore doesn't seem to load successfully, but we don't use USB on the system anyway. The bus in most use is the megaraid.)
I know it's a lame answer, but could you try upgrading to the 2.4.3-12 kernel?
Now I know it exists, we probably will. The main problem with this is that the revision histories I've read so far don't make any mention of the sort of lockups I'm seeing, which leads me to suspect it won't be fixed... I'll report back if we have the problems again.
'uname' now gives: Linux bob 2.4.3-12smp #1 SMP Fri Jun 8 14:38:50 EDT 2001 i686 unknown ...and it's happened again. Taken a while for it to occur, admittedly, but it always was a bit random.
It's crashed 3 times this week, which is annoying since it's a machine used for compilation and work by about 25 people. Any more bright ideas?
I've put an experimental 2.4.3-15 up at http://people.redhat.com/arjanv/testkernels which has a reiserfs bugfix and a "top crashes" bugfix in it. You could give that a shot, to see if the top bugfix is actually a fix for your top/ps bug too.
I have experienced the same behavior on a single CPU machine. top and ps will hang and some of the running processes are extremely slow (at least 30x slower than normal). Linux version 2.4.2-2 (root.redhat.com) (gcc version 2.96 20000731 ( Red Hat Linux 7.1 2.96-79)) #1 Sun Apr 8 20:41:30 EDT 2001 Initializing CPU#0 Detected 1002.291 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 1998.84 BogoMIPS Memory: 1028572k/1048512k available (1365k kernel code, 19552k reserved, 92k dat a, 236k init, 131008k highmem)
I've now used the new kernel (2.4.3-15). Unsurprisingly, it's doing just the same thing. We've monitored the /proc filesystem through the last few lockups and we've noticed that it always seems to be the directory of an Oracle communications process (the process that runs when you use the bequeath protocol and communicates with the server processes). This is an eduacted guess made by examining the processes that we can still identify on the running system - obviously, since we can't read the /proc information it's not possible to make a direct identification of the process. This is pretty obviously only a trigger, though - no process should be hanging up on a /proc read. We've also seen the machine recover from this state, exactly once, with the 2.4.3-12 kernel. We don't know why. Is there any easy way of getting debugging out of a stock Redhat kernel? Up until now, we've only tried a unfocussed "upgrade and cross your fingers" approach, and it sounds to me like we need to try to identify the problem more specifically so that we can get to the root cause and be certain that we have a fix for it.
I answered some of my own questions, and I'll attach some information on the last lockup garnered with SysRq and System.map. In this instance, the machine was basically idle and the only processors active were 0 (running swapper, occasionally) and 3 (running the oracle process; 100% system time according to xosview).
Created attachment 27693 [details] SysRq output from the kernel when in a locked state.
I'm seeing this on a RedHat 7.3 system with 2.4.19 (latest RH kernel) Usually this happens after running a java app (using IBM's JDK 1.3.1 for Linux). The process will spawn a couple of threads and take all the available CPU. After I kill the processes, the ps command will hang (as will top). The only way to get around it is to do a hard reboot (shutdown and reboot do not work).
My kernel is actually 2.4.18-18.7.xcustom (2.4.18-7.x kernel source distributed by RH plus NTFS support compiled as a module)
That's likely to be a seperate problem -- open a new bug. Please collect the sysrq output from the kernel, as the backtraces will tell which process is stuck holding the mm semaphore. Also, check dmesg for any kernel messages during the run. The original bug this message is referring to is presumed fixed during the 7.x cycle, as several problems of this nature were corrected.
I opened bug 80960 as requested.