Bug 145121

Summary:	ccsd can get stuck on startup with zombie child
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Adam "mantis" Manthei <amanthei>
Component:	ccs	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3	CC:	cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-05-25 16:41:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Adam "mantis" Manthei 2005-01-14 16:03:16 UTC

Description of problem:
The startup process for ccsd leaves a small race where the child
process can die but the parent won't recognize it, resulting in a
zombie child and a parent process that never stops when trying to
daemonize.

Version-Release number of selected component (if applicable):
GFS-6.0.2-24

How reproducible:
always

Steps to Reproduce:
1. start ccsd from an initrd where the local device (lo) has not been
configured
2.
3.
  
Actual results:
With lo not configured, the child will exit with the error "Unable to
bind socket: annot assign requested address" (although you will not
see this in daemon mode).  Meanwhile, the parent is running the
following code expecting the child to send it sigterm:

    } else if (pid !=0){
      while(1){
        sleep(30);
      }
      /* close the parent */
      exit(EXIT_SUCCESS);
    }

It would probably be safer for the parent to do a quick check on the
child with `waitpid(pid,status,WNOHANG)` instead of a while(1) loop
that may never finish if the child dies before sending sigterm.

Expected results:
ccsd should have a nonzero exit status

Additional info:

Comment 1 Jonathan Earl Brassow 2005-01-14 17:15:46 UTC

There are two issues that this bug has brought to my attention.  First, the issue 
that is stated.  Second, is the way the parent exits.

The parent calls exit from a signal handler.  Some versions of gcc will not 
allow exit to be called from within a signal handler.  Although RHEL3 should 
have the right compilers, we have moved away from doing it this way.  Now, 
we set a variable in the signal handler that is checked when we return from it.

The 1st issue has been addressed by calling waitpid.  An additional benefit of 
this is that the parent can now check the exit code of the child if it fails and 
determine how it failed and print an appropriate error message.

This fix also has the side benefit of fixing the ugly hack I was using to check 
the lockfile.  The hack also left open a race where we could get a zombie 
process if two ccsd where started at the same time.

Comment 2 Jay Turner 2005-05-25 16:41:11 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-466.html