Bug 145121 - ccsd can get stuck on startup with zombie child
ccsd can get stuck on startup with zombie child
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: ccs (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2005-01-14 11:03 EST by Adam "mantis" Manthei
Modified: 2009-04-16 16:04 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2005-05-25 12:41:10 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Adam "mantis" Manthei 2005-01-14 11:03:16 EST
Description of problem:
The startup process for ccsd leaves a small race where the child
process can die but the parent won't recognize it, resulting in a
zombie child and a parent process that never stops when trying to

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. start ccsd from an initrd where the local device (lo) has not been
Actual results:
With lo not configured, the child will exit with the error "Unable to
bind socket: annot assign requested address" (although you will not
see this in daemon mode).  Meanwhile, the parent is running the
following code expecting the child to send it sigterm:

    } else if (pid !=0){
      /* close the parent */

It would probably be safer for the parent to do a quick check on the
child with `waitpid(pid,status,WNOHANG)` instead of a while(1) loop
that may never finish if the child dies before sending sigterm.

Expected results:
ccsd should have a nonzero exit status

Additional info:
Comment 1 Jonathan Earl Brassow 2005-01-14 12:15:46 EST
There are two issues that this bug has brought to my attention.  First, the issue 
that is stated.  Second, is the way the parent exits.

The parent calls exit from a signal handler.  Some versions of gcc will not 
allow exit to be called from within a signal handler.  Although RHEL3 should 
have the right compilers, we have moved away from doing it this way.  Now, 
we set a variable in the signal handler that is checked when we return from it.

The 1st issue has been addressed by calling waitpid.  An additional benefit of 
this is that the parent can now check the exit code of the child if it fails and 
determine how it failed and print an appropriate error message.

This fix also has the side benefit of fixing the ugly hack I was using to check 
the lockfile.  The hack also left open a race where we could get a zombie 
process if two ccsd where started at the same time.
Comment 2 Jay Turner 2005-05-25 12:41:11 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.