Bug 56753

Summary:	ps is hung, forever
Product:	[Retired] Red Hat Linux	Reporter:	Frank Hirtz <fhirtz>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED ERRATA	QA Contact:	Brock Organ <borgan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.1
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2003-03-26 18:39:53 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Frank Hirtz 2001-11-26 23:02:22 UTC

Description of Problem:

ps was stuck on a specific process.

$ cd /proc/14985
$ cat cmdline (or status or stat or
 maps)
 
   -hangs!!!!!!!!

$ strace cat maps
  
  ......
   
open("maps", O_RDONLY|O_LARGEFILE)      = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
brk(0x8051000)                          = 0x8051000
read(3, 

 the process is a user process (in a case I have happening right now).
Can't say what the process is, because I can't even do an "ls -l" on any
file in /proc/${pid} for the process in question (it hangs). It may be
related to processes spawned by running "man", as that's the first window
where I experienced a hang. I assume the process still exists, as the
directory still exists under /proc, but again I can't look at it because of
the issue already mentioned. strace somehow releases whatever was blocking,
so the problem has now gone away. Here's the strace: 

$ strace -p 13181 close(4) = 0 open(\"/lib/i686/libm.so.6\", O_RDONLY) = 4
read(4,
"\\177ELF\\1\\1\\1\\0\\0\\0\\0\\0\\0\\0\\0\\0\\3\\0\\3\\0\\1\\0\\0\\0\\320H\\0\"...,
1024) = 1024 
fstat64(4, {st_mode=S_IFREG|0755, st_size=624962, ...}) = 0 
old_mmap(NULL, 142580, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x4006b000
mprotect(0x4008d000, 3316, PROT_NONE) = 0 old_mmap(0x4008d000, 4096,
PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x21000) = 0x4008d000
close(4) = 0 
open(\"/lib/i686/libc.so.6\", O_RDONLY) = 4 
read(4,"\\177ELF\\1\\1\\1\\0\\0\\0\\0\\0\\0\\0\\0\\0\\3\\0\\3\\0\\1\\0\\0\\0@\\307\\1\"...,
1024) = 1024 
fstat64(4, {st_mode=S_IFREG|0755, st_size=5780406, ...}) = 0 
old_mmap(NULL, 1291080, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = x4008e000
mprotect(0x401c0000, 37704, PROT_NONE) = 0 
old_mmap(0x401c0000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4,
0x131000) = 0x401c0000 
old_mmap(0x401c6000, 13128,
PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x401c6000 
close(4) = 0 munmap(0x40018000, 62314) = 0 brk(0) = 0x805fd8c
brk(0x805fe2c) = 0x805fe2c 
brk(0x8060000) = 0x8060000 
brk(0x8061000) = 0x8061000 
open("./devlatin1/DESC\", O_RDONLY) = -1 ENOENT (No such file or directory)
open(\"/usr/share/groff/font/devlatin1/DESC\", O_RDONLY) = 4 
fstat64(4, {st_mode=S_IFREG|0644, st_size=88, ...}) = 0 
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x40018000 
read(4, "res 240\\nhor 24\\nvert 40\\nunitwidth\"..., 4096) = 88 
read(4, \"\", 4096) = 0 
close(4) = 0 
munmap(0x40018000, 4096) = 0 
pipe([4, 5]) = 0 
fork() = 13320 
close(5) = 0 
pipe([5, 6]) = 0 
fork() = 13321 
close(4) = 0 
close(6) = 0 
fork() = 13322 
close(5) = 0 
wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], 0, NULL) = 13320 ---
SIGCHLD (Child exited) @ 0 (0) --- wait4(-1, [WIFEXITED(s) &&
WEXITSTATUS(s) == 0], 0, NULL) = 13321 --- SIGCHLD (Child exited) @ 0 (0)
--- wait4(-1, [WIFEXITED(s) && WEXITSTATUS(s) == 0], 0, NULL) = 13322 ---
SIGCHLD (Child exited) @ 0 (0) --- _exit(0) = ?

Version-Release number of selected component (if applicable):

2.4.9-12smp

How Reproducible:
It occurs irregularly. It happens regularly enough in the environment that
we can get any information which may assist in correcting this. 

Additional Information:
	
It seems as if it may be similar to bugs # 46937,49208 and 51236.

Comment 1 Arjan van de Ven 2001-11-27 08:44:19 UTC

There's a known deadlock opportunity in the coredump code vs a semaphore "ps" or
"top" take. Is it possible the task in question was about to dump core ?

Comment 2 Frank Hirtz 2001-11-27 17:39:45 UTC

I doubt this is the problem, unless the subprocesses spawned by "man lockd" often
coredump, especially with nobody else on the system... and neither ps or top
were running at the time - I ran ps after the "man" command seemed to hang, not
before.

Comment 3 Arjan van de Ven 2001-11-27 19:29:29 UTC

which filesystem(s) are in use? esp where are the manpages ?

Comment 4 Frank Hirtz 2001-11-27 19:58:42 UTC

afs (openafs 1.2.2) mounted, like most of the installed and running software in
our environment.
 
There were no afs errors on the console or in syslog messages before, during or
after the time
of the /proc hang.

Comment 5 Arjan van de Ven 2001-11-27 21:38:28 UTC

There's no oops in dmesg I assume ?

Comment 6 Frank Hirtz 2001-11-27 22:46:52 UTC

No oops or any interesting messages anywhere.

Comment 7 Arjan van de Ven 2001-11-28 08:34:49 UTC

Ok can you enable sysreq and get alt-sysreq-T output ? That should give a list
of processes and where in the kernel they are; hopefully that will shed light on
where things are stuck.

Comment 8 Frank Hirtz 2003-03-26 18:39:53 UTC

Corrected in 2.4.9-e.8 kernel.