82134 – processes being left as zombies after ^c of parent

Bug 82134 - processes being left as zombies after ^c of parent

Summary: processes being left as zombies after ^c of parent

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-01-17 22:31 UTC by Ben Woodard
Modified:	2007-04-18 16:50 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-02-25 19:55:59 UTC
Embargoed:

Attachments	(Terms of Use)
patch which fixed the problem for us (2.93 KB, patch) 2003-01-17 22:32 UTC, Ben Woodard	no flags	Details \| Diff
reproducer for the bug (1.28 KB, application/x-gzip) 2003-01-17 22:50 UTC, Ben Woodard	no flags	Details
View All

Description Ben Woodard 2003-01-17 22:31:14 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Description of problem:
 The bug as I understand it is this:

process forks()
parent does some NFS stuff that blocks
someone hits ^c
parent goes away but leaves the request for the page
child exits
There is no one to reap them and so they stay in the D state.

We found this patch on Trond\'s page under linux-2.4.20-16-waitq.dif It fixes
our problem and has been working in production for several days now.
http://www.fys.uio.no/~trondmy/src/2.4.20/linux-2.4.20-16-waitq.dif from:
http://www.fys.uio.no/~trondmy/src/2.4.20/

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. have a process which does NFS stuff fork a child
2. killall by the name of the process
3. notice that the child is sometimes left in D state
    

Actual Results:  process left in D state

Expected Results:  process is inherited by init

Additional info:

Comment 1 Ben Woodard 2003-01-17 22:32:07 UTC

Created attachment 89427 [details]
patch which fixed the problem for us

Comment 2 Ben Woodard 2003-01-17 22:33:59 UTC

Note that the key thing to understand here is that the parent exects through the
error case when you hit ^c.

Comment 5 Ben Woodard 2003-01-17 22:42:59 UTC

	From: 	Jim Garlick <garlick>
To: 	Ben Woodard <bwoodard>
Cc: 	Mike Haskell <haskell5>
Subject: 	Re: Kernel/131
Date: 	Fri, 17 Jan 2003 14:37:11 -0800 (PST)	
I actually seemed to need two processes reading from two separate files
to get it to happen.  I attached a "sure fire" reproducer to our gnat
which starts two threads reading at the same time.

Run the reproducer once on an SMP system, hit ^C, then try to umount
the filesystem.  umount will hang and crash will show it is waiting
uninterruptibly for a lock on the page.

Jim

Comment 6 Ben Woodard 2003-01-17 22:48:06 UTC

Sorry got two similar bugs confused. This is the original text:

On MCR we have a problem that seems to occur when processes
doing I/O to bluearcs over NFS are interrupted with a
SIGINT.  Processes will block in an unkillable state,
stuck in a read system call to a file on a remote bluearc.
Crash says the stack for the hung process looks like this:           
                                                                               
system_call -> sys_read -> nfs_file_read ->
      generic_file_read -> do_generic_file_read ->
      lock_page -> __lock_page -> schedule             
                                                                               
Following the page referenced in the __lock_page argument
(and correlating with pages from hung processes on
several other nodes), I see the pages have PG_locked,
PG_dirty, and PG_error flags set).  PG_locked explains         
why the system call went to sleep with TASK_UNINTERRUPTABLE
on the page's wait queue.  Why the lock is never released
is another question!                     
                                                                               
There is no NFS client retry activity on the eip0
interface; however, the NFS server and even the file
remains responsive from the node with the hung     
process (located by tracing the page to its inode).                             
                                                                               
I guess it is no suprise that dd'ing the entire file to /dev/null on the node
with the hung process causes dd
to hang on the same page - there is still a PG_locked bit
on it and nobody is going to release the lock!           

=========Added garlick 2003-01-16
Applied trond patch described in audit trail and was
unable to reproduce using single node test.  Try test on
whole cluster.

Noted another failure mode on unpatched kerenl - after
aborting reproducer prun with ctrl-C, no c2d processes were
left behind but a umount hung in __wait_on_page.

I *was* able to reproduce both the zombie c2d's and umounts
on MDEV.

A more "sure fire" reproducer (myzmb.tgz) that just reads
some big files on two CPU's is attached.  Instructions
are in the .c file.  This one so far has always created
a zombie umount.

Comment 7 Ben Woodard 2003-01-17 22:50:46 UTC

Created attachment 89428 [details]
reproducer for the bug

Comment 8 Ben Woodard 2003-01-17 23:56:18 UTC

	From: 	Mike Haskell <haskell5>
Reply-To: 	haskell5
To: 	Ben Woodard <bwoodard>
Cc: 	Jim Garlick <garlick>
Subject: 	Re: Kernel/131
Date: 	Fri, 17 Jan 2003 15:45:51 -0800	
Actually, the parent goes into a 'D' state waiting for a page that will 
never arrive.  The children finish and are waiting in do_wait() via
exit() as zombies for their parent to come along and reap them.  The 
umount hangs because of outstanding pages the parent has related to
nfs server via the client.  Why we didn't get a "umount: filesystem 
busy" error is unkown.  It obviously thought it could be let go.

Comment 9 Dave Maley 2004-02-25 19:55:59 UTC

seems to be fixed in current RHEL3 kernels
- LLNL has closed IT ticket --> closing BZ

Note You need to log in before you can comment on or make changes to this bug.