46937 – 'ps' and 'top' hang, unkillably, on read of /proc/xxx/stat

Bug 46937 - 'ps' and 'top' hang, unkillably, on read of /proc/xxx/stat

Summary: 'ps' and 'top' hang, unkillably, on read of /proc/xxx/stat

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Aaron Brown
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-07-02 10:02 UTC by Need Real Name
Modified:	2007-04-18 16:34 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-01-02 16:58:01 UTC
Embargoed:

Attachments	(Terms of Use)
SysRq output from the kernel when in a locked state. (4.10 KB, text/plain) 2001-08-14 12:44 UTC, Need Real Name	no flags	Details
View All

Description Need Real Name 2001-07-02 10:02:55 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; WinNT4.0; en-US; rv:0.9.1) Gecko/20010607

Description of problem:
I think this is actually a deadlock of some sort in the kernel, but the
actual effect is that 'ps aux' hangs up and can't be kill -9'ed.
'strace ps aux' runs through a load of the processes and then gives:

  stat64("/proc/30945", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
  open("/proc/30945/stat", O_RDONLY)      = 7
  read(7,

(Interestingly, I could Ctrl-C out of this.  I can't tell if the 'ps'
process died, or just the 'strace', though, because 'ps' doesn't work. ;-)

'xosview' shows that one CPU spends almost 100% of its time doing 'system'
work.  (Our machine has 4 CPUs.)

I don't know what triggers this; it's happened twice now, the first time
after 2 weeks of uptime, this time after only about 3 days.

I don't know if this occurs on uniprocessor machines or non-x86 machines. 
We have a quad-PIII-Xeon machine and that's what we're seeing it on.

How reproducible:
Sometimes

Steps to Reproduce:
When it happens, running 'ps aux' will hang.  I haven't found a way to make
it happen yet.

Additional info:

Comment 1 Arjan van de Ven 2001-07-02 16:26:28 UTC

Which kernel version is this ?
2.4.2-2 or the recently released 2.4.3-12 update kernel ?

Comment 2 Need Real Name 2001-07-02 16:32:55 UTC

bob:~: $ uname -a
Linux bob 2.4.2-2smp #1 SMP Sun Apr 8 20:21:34 EDT 2001 i686 unknown
bob:~: $ /sbin/lsmod
Module                  Size  Used by
nfs                    82816   1 (autoclean)
nfsd                   70976   8 (autoclean)
lockd                  53232   1 (autoclean) [nfs nfsd]
sunrpc                 66352   1 (autoclean) [nfs nfsd lockd]
eepro100               17232   1 (autoclean)
ipchains               41632   0 (unused)
usbcore                52416   1
aic7xxx               136336   0
megaraid               21712  10
sd_mod                 11744  10
scsi_mod               98624   3 [aic7xxx megaraid sd_mod]

(usbcore doesn't seem to load successfully, but we don't use USB on the system
anyway.  The bus in most use is the megaraid.)

Comment 3 Arjan van de Ven 2001-07-02 16:37:03 UTC

I know it's a lame answer, but could you try upgrading to the 2.4.3-12 kernel?

Comment 4 Need Real Name 2001-07-02 16:46:33 UTC

Now I know it exists, we probably will.  The main problem with this is that the
revision histories I've read so far don't make any mention of the sort of
lockups I'm seeing, which leads me to suspect it won't be fixed...

I'll report back if we have the problems again.

Comment 5 Need Real Name 2001-07-20 11:13:48 UTC

'uname' now gives:

Linux bob 2.4.3-12smp #1 SMP Fri Jun 8 14:38:50 EDT 2001 i686 unknown

...and it's happened again.  Taken a while for it to occur, admittedly, but it 
always was a bit random.

Comment 6 Need Real Name 2001-08-03 14:29:32 UTC

It's crashed 3 times this week, which is annoying since it's a machine used for
compilation and work by about 25 people.  Any more bright ideas?

Comment 7 Arjan van de Ven 2001-08-03 14:31:59 UTC

I've put an experimental 2.4.3-15 up at
http://people.redhat.com/arjanv/testkernels which has a reiserfs bugfix and a
"top crashes" bugfix in it. You could give that a shot, to see if the top bugfix
is actually a fix for your top/ps bug too.

Comment 8 Need Real Name 2001-08-09 07:18:38 UTC

I have experienced the same behavior on a single CPU machine. top and
ps will hang and some of the running processes are extremely slow (at
least 30x slower than normal).

Linux version 2.4.2-2 (root.redhat.com) (gcc version 2.96 20000731 (
Red Hat Linux 7.1 2.96-79)) #1 Sun Apr 8 20:41:30 EDT 2001

Initializing CPU#0
Detected 1002.291 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1998.84 BogoMIPS
Memory: 1028572k/1048512k available (1365k kernel code, 19552k reserved, 92k dat
a, 236k init, 131008k highmem)

Comment 9 Need Real Name 2001-08-14 11:03:42 UTC

I've now used the new kernel (2.4.3-15).  Unsurprisingly, it's doing just the
same thing.

We've monitored the /proc filesystem through the last few lockups and we've
noticed that it always seems to be the directory of an Oracle communications
process (the process that runs when you use the bequeath protocol and
communicates with the server processes).  This is an eduacted guess made by
examining the processes that we can still identify on the running system -
obviously, since we can't read the /proc information it's not possible to make a
direct identification of the process.

This is pretty obviously only a trigger, though - no process should be hanging
up on a /proc read.

We've also seen the machine recover from this state, exactly once, with the
2.4.3-12 kernel.  We don't know why.

Is there any easy way of getting debugging out of a stock Redhat kernel?  Up
until now, we've only tried a unfocussed "upgrade and cross your fingers"
approach, and it sounds to me like we need to try to identify the problem more
specifically so that we can get to the root cause and be certain that we have a
fix for it.

Comment 10 Need Real Name 2001-08-14 12:43:20 UTC

I answered some of my own questions, and I'll attach some information on the
last lockup garnered with SysRq and System.map.  In this instance, the machine
was basically idle and the only processors active were 0 (running swapper,
occasionally) and 3 (running the oracle process; 100% system time according to
xosview).

Comment 11 Need Real Name 2001-08-14 12:44:25 UTC

Created attachment 27693 [details]
SysRq output from the kernel when in a locked state.

Comment 12 Thomas J. Philpot 2002-12-30 16:23:24 UTC

I'm seeing this on  a RedHat 7.3 system with 2.4.19 (latest RH kernel) Usually
this happens after running a java app (using IBM's JDK 1.3.1 for Linux).  The
process will spawn a couple of threads and take all the available CPU.  After I
kill the processes, the ps command will hang (as will top).  The only way to get
around it is to do a hard reboot (shutdown and reboot do not work).

Comment 13 Thomas J. Philpot 2002-12-30 16:38:25 UTC

My kernel is actually 2.4.18-18.7.xcustom (2.4.18-7.x kernel source distributed
by RH plus NTFS support compiled as a module)

Comment 14 Ben LaHaise 2003-01-02 16:58:01 UTC

That's likely to be a seperate problem -- open a new bug.  Please collect the
sysrq output from the kernel, as the backtraces will tell which process is stuck
holding the mm semaphore.  Also, check dmesg for any kernel messages during the
run.  The original bug this message is referring to is presumed fixed during the
7.x cycle, as several problems of this nature were corrected.

Comment 15 Thomas J. Philpot 2003-01-02 21:01:30 UTC

I opened bug 80960 as requested.

Note You need to log in before you can comment on or make changes to this bug.