Description of problem: According to https://fedoraproject.org/wiki/Packaging/SysVInitScript condor component init script have lsb compliance issues Version-Release number of selected component (if applicable): RH-7.4.5-0.8.el5 How reproducible: always Steps to Reproduce: 1. install condor 2. try to restart it few times - service condor restart 3. wait for "Warning: condor_master may not have exited, start/restart may fail" message 4. check init script exit code - echo $? Actual results: Init script doesn't wait for condor daemons to stop and still return 0 as exit code after 'service condor restart' called. Expected results: Init script must to wait for condor daemons or end itself with code other than 0 Additional info: # service condor restart Stopping Condor daemons: [ OK ] Starting Condor daemons: [ OK ] # service condor restart Stopping Condor daemons: [ OK ] Warning: condor_master may not have exited, start/restart may fail Starting Condor daemons: # echo $? 0
Affected code in /etc/init.d/condor: stop() { echo -n $"Stopping Condor daemons: " killproc -p $pidfile $prog -QUIT RETVAL=$? echo wait_pid $pidfile 15 if [ $? -ne 0 ]; then # If this happens during a restart the start is likely to see # condor still running and just return 0, which means when # condor exits it won't be restarted echo $"Warning: $prog may not have exited, start/restart may fail" RETVAL=1 fi [ $RETVAL -eq 0 ] && rm -f $lockfile return $RETVAL } Could be wait_pid value changed to something bigger? Can we change it to 60sec, for example? It's could be enough to see all daemons down, and we can be sure that all daemons are successfully down. The 'service condor restart' can take longer to finish, but it will be working. In case of return value, we need to change RETVAL!=0, if the restart doesn't finish in threshold. All scripts built on top of our init scripts can benefits from it after. We don't know if deamons are really down or not, now.
Skipping discussion of 15 sec vs 30 sec vs 300 sec for now. The reason the return value is 0 is because restart = stop;start and the result of stop is discarded. This is also an issue found in the example script from the Fedora link.
I imagine the underlying issue you have is though it makes sense for restart to ignore the result of stop, the stop is initiated and completes after the start completes. The start does not reverse the stop operation, and likely should not. The warning is there to warn you that your restart attempt may have failed. In fact, restart may have operated as a stop. Are you trying to program something that relies on the init script return codes?
Yes. We are using init scripts in almost all of our automated tests. I understand that restart is stop() followed by start(). The main issue is with RETVAL=0 when stop() failed. Following schema can explain better what can be expected in all cases: | stop() | start() | restart() ------------------------------------- RETVAL | 0 | 0 | 0 | 1 | - | 1 | 0 | 1 | 1 In case of discussion about timer, there is no requirement to change the value again and again and again :) I created following script to enhance stop() function (proof of concept). Basically the script remember # of running daemons in each iteration and reset counter if shutdown process is still in progress. stop() { echo -n $"Stopping Condor daemons: " killproc -p $pidfile $prog -QUIT RETVAL=$? echo DAEMON_TIMER=15 DAEMONS_COUNT=`ps ax | grep condor | wc -l` while [ "$DAEMON_TIMER" -gt "0" ]; do echo "waitpid" wait_pid $pidfile $DAEMON_TIMER if [ $? -ne 0 ]; then # If this happens during a restart the start is likely to see # condor still running and just return 0, which means when # condor exits it won't be restarted #echo $"Warning: $prog may not have exited, start/restart may fail" DAEMONS_ALIVE=`ps ax | grep condor | wc -l` echo "Still $DAEMONS_ALIVE/$DAEMONS_COUNT deamons alive..." # we can check if some deamons are down if [ $DAEMONS_ALIVE -lt $DAEMONS_COUNT ]; then echo "Resetting counter..." DAEMONS_COUNT=$DAEMONS_ALIVE else DAEMON_TIMER=0 fi RETVAL=1 else DAEMON_TIMER=0 fi; done [ $RETVAL -eq 0 ] && rm -f $lockfile return $RETVAL } Let the condor runs about 100 jobs with NUM_CPUS=100 and try restart. Slow machine - better results. Does it sounds reasonable for you?
http://refspecs.freestandards.org/LSB_3.1.1/LSB-Core-generic/LSB-Core-generic/iniscrptact.html The restart, try-restart, reload and force-reload actions may be atomic; that is if a service is known not to be operational after a restart or reload, the script may return an error without any further action. -- This sounds like it allows for your "1 - 1" row.
commit b99b97a78fa6c5a312b126467fc55bff42f7f90e Author: Matthew Farrellee <matt@> Date: Tue Feb 22 16:33:40 2011 -0500 Abort restart attempt with error if stop step fails diff --git a/src/condor_examples/condor.init b/src/condor_examples/condor.init index db565a9..6d36110 100644 --- a/src/condor_examples/condor.init +++ b/src/condor_examples/condor.init @@ -186,14 +186,18 @@ case "$1" in RETVAL=$? ;; restart) - [ $running -eq 0 ] && stop - start + RETVAL=0 + if [ $running -eq 0 ]; then + stop + RETVAL=$? + fi + [ $RETVAL -eq 0 ] && start RETVAL=$? ;; condrestart|try-restart) [ $running -eq 0 ] || exit 0 stop - start + [ $? -eq 0 ] && start RETVAL=$? ;; reload|force-reload)
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: restart actions unconditionally start and ignore errors from the stop action. C: It is difficult to script the init script F: The init script skips a start attempt and reports an error if stopping fails during a restart. R: The condor sysV init script now reports an error if stopping the service fails during a restart operation.
Retested over all supported architectures x86,x86_64/RHEL5,RHEL6 with: condor-7.6.1-0.4 Restart function in condor init script changed: restart) RETVAL=0 if [ $running -eq 0 ]; then stop RETVAL=$? fi [ $RETVAL -eq 0 ] && start RETVAL=$? ;; # service condor restart Stopping Condor daemons: [ OK ] Warning: condor_master may not have exited, start/restart may fail # echo $? 1 Return code is now correct. >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html