145121 – ccsd can get stuck on startup with zombie child

Bug 145121 - ccsd can get stuck on startup with zombie child

Summary: ccsd can get stuck on startup with zombie child

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	ccs
Sub Component:
Version:	3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-14 16:03 UTC by Adam "mantis" Manthei
Modified:	2009-04-16 20:04 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-05-25 16:41:10 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:466	0	normal	SHIPPED_LIVE	GFS bug fix update	2005-05-25 04:00:00 UTC

Description Adam "mantis" Manthei 2005-01-14 16:03:16 UTC

Description of problem:
The startup process for ccsd leaves a small race where the child
process can die but the parent won't recognize it, resulting in a
zombie child and a parent process that never stops when trying to
daemonize.

Version-Release number of selected component (if applicable):
GFS-6.0.2-24

How reproducible:
always

Steps to Reproduce:
1. start ccsd from an initrd where the local device (lo) has not been
configured
2.
3.
  
Actual results:
With lo not configured, the child will exit with the error "Unable to
bind socket: annot assign requested address" (although you will not
see this in daemon mode).  Meanwhile, the parent is running the
following code expecting the child to send it sigterm:

    } else if (pid !=0){
      while(1){
        sleep(30);
      }
      /* close the parent */
      exit(EXIT_SUCCESS);
    }

It would probably be safer for the parent to do a quick check on the
child with `waitpid(pid,status,WNOHANG)` instead of a while(1) loop
that may never finish if the child dies before sending sigterm.

Expected results:
ccsd should have a nonzero exit status

Additional info:

Comment 1 Jonathan Earl Brassow 2005-01-14 17:15:46 UTC

There are two issues that this bug has brought to my attention.  First, the issue 
that is stated.  Second, is the way the parent exits.

The parent calls exit from a signal handler.  Some versions of gcc will not 
allow exit to be called from within a signal handler.  Although RHEL3 should 
have the right compilers, we have moved away from doing it this way.  Now, 
we set a variable in the signal handler that is checked when we return from it.

The 1st issue has been addressed by calling waitpid.  An additional benefit of 
this is that the parent can now check the exit code of the child if it fails and 
determine how it failed and print an appropriate error message.

This fix also has the side benefit of fixing the ugly hack I was using to check 
the lockfile.  The hack also left open a race where we could get a zombie 
process if two ccsd where started at the same time.

Comment 2 Jay Turner 2005-05-25 16:41:11 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-466.html

Note You need to log in before you can comment on or make changes to this bug.