Description of problem: The startup process for ccsd leaves a small race where the child process can die but the parent won't recognize it, resulting in a zombie child and a parent process that never stops when trying to daemonize. Version-Release number of selected component (if applicable): GFS-6.0.2-24 How reproducible: always Steps to Reproduce: 1. start ccsd from an initrd where the local device (lo) has not been configured 2. 3. Actual results: With lo not configured, the child will exit with the error "Unable to bind socket: annot assign requested address" (although you will not see this in daemon mode). Meanwhile, the parent is running the following code expecting the child to send it sigterm: } else if (pid !=0){ while(1){ sleep(30); } /* close the parent */ exit(EXIT_SUCCESS); } It would probably be safer for the parent to do a quick check on the child with `waitpid(pid,status,WNOHANG)` instead of a while(1) loop that may never finish if the child dies before sending sigterm. Expected results: ccsd should have a nonzero exit status Additional info:
There are two issues that this bug has brought to my attention. First, the issue that is stated. Second, is the way the parent exits. The parent calls exit from a signal handler. Some versions of gcc will not allow exit to be called from within a signal handler. Although RHEL3 should have the right compilers, we have moved away from doing it this way. Now, we set a variable in the signal handler that is checked when we return from it. The 1st issue has been addressed by calling waitpid. An additional benefit of this is that the parent can now check the exit code of the child if it fails and determine how it failed and print an appropriate error message. This fix also has the side benefit of fixing the ugly hack I was using to check the lockfile. The hack also left open a race where we could get a zombie process if two ccsd where started at the same time.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-466.html