| Summary: | initscript lsb compliance | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Tomas Rusnak <trusnak> |
| Component: | condor | Assignee: | Matthew Farrellee <matt> |
| Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 1.3 | CC: | iboverma, ltoscano, matt, sgraf |
| Target Milestone: | 2.0 | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | condor-7.5.6-0.1 | Doc Type: | Bug Fix |
| Doc Text: |
C: restart actions unconditionally start and ignore errors from the stop action.
C: It is difficult to script the init script
F: The init script skips a start attempt and reports an error if stopping fails during a restart.
R: The condor sysV init script now reports an error if stopping the service fails during a restart operation.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-06-23 15:39:18 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 693778 | ||
|
Description
Tomas Rusnak
2011-02-07 14:27:58 UTC
Affected code in /etc/init.d/condor:
stop() {
echo -n $"Stopping Condor daemons: "
killproc -p $pidfile $prog -QUIT
RETVAL=$?
echo
wait_pid $pidfile 15
if [ $? -ne 0 ]; then
# If this happens during a restart the start is likely to see
# condor still running and just return 0, which means when
# condor exits it won't be restarted
echo $"Warning: $prog may not have exited, start/restart may fail"
RETVAL=1
fi
[ $RETVAL -eq 0 ] && rm -f $lockfile
return $RETVAL
}
Could be wait_pid value changed to something bigger? Can we change it to 60sec, for example? It's could be enough to see all daemons down, and we can be sure that all daemons are successfully down. The 'service condor restart' can take longer to finish, but it will be working.
In case of return value, we need to change RETVAL!=0, if the restart doesn't finish in threshold. All scripts built on top of our init scripts can benefits from it after. We don't know if deamons are really down or not, now.
Skipping discussion of 15 sec vs 30 sec vs 300 sec for now. The reason the return value is 0 is because restart = stop;start and the result of stop is discarded. This is also an issue found in the example script from the Fedora link. I imagine the underlying issue you have is though it makes sense for restart to ignore the result of stop, the stop is initiated and completes after the start completes. The start does not reverse the stop operation, and likely should not. The warning is there to warn you that your restart attempt may have failed. In fact, restart may have operated as a stop. Are you trying to program something that relies on the init script return codes? Yes. We are using init scripts in almost all of our automated tests.
I understand that restart is stop() followed by start(). The main issue is with RETVAL=0 when stop() failed.
Following schema can explain better what can be expected in all cases:
| stop() | start() | restart()
-------------------------------------
RETVAL | 0 | 0 | 0
| 1 | - | 1
| 0 | 1 | 1
In case of discussion about timer, there is no requirement to change the value again and again and again :) I created following script to enhance stop() function (proof of concept).
Basically the script remember # of running daemons in each iteration and reset counter if shutdown process is still in progress.
stop() {
echo -n $"Stopping Condor daemons: "
killproc -p $pidfile $prog -QUIT
RETVAL=$?
echo
DAEMON_TIMER=15
DAEMONS_COUNT=`ps ax | grep condor | wc -l`
while [ "$DAEMON_TIMER" -gt "0" ]; do
echo "waitpid"
wait_pid $pidfile $DAEMON_TIMER
if [ $? -ne 0 ]; then
# If this happens during a restart the start is likely to see
# condor still running and just return 0, which means when
# condor exits it won't be restarted
#echo $"Warning: $prog may not have exited, start/restart may fail"
DAEMONS_ALIVE=`ps ax | grep condor | wc -l`
echo "Still $DAEMONS_ALIVE/$DAEMONS_COUNT deamons alive..."
# we can check if some deamons are down
if [ $DAEMONS_ALIVE -lt $DAEMONS_COUNT ]; then
echo "Resetting counter..."
DAEMONS_COUNT=$DAEMONS_ALIVE
else
DAEMON_TIMER=0
fi
RETVAL=1
else
DAEMON_TIMER=0
fi;
done
[ $RETVAL -eq 0 ] && rm -f $lockfile
return $RETVAL
}
Let the condor runs about 100 jobs with NUM_CPUS=100 and try restart. Slow machine - better results.
Does it sounds reasonable for you?
http://refspecs.freestandards.org/LSB_3.1.1/LSB-Core-generic/LSB-Core-generic/iniscrptact.html The restart, try-restart, reload and force-reload actions may be atomic; that is if a service is known not to be operational after a restart or reload, the script may return an error without any further action. -- This sounds like it allows for your "1 - 1" row. commit b99b97a78fa6c5a312b126467fc55bff42f7f90e
Author: Matthew Farrellee <matt@>
Date: Tue Feb 22 16:33:40 2011 -0500
Abort restart attempt with error if stop step fails
diff --git a/src/condor_examples/condor.init b/src/condor_examples/condor.init
index db565a9..6d36110 100644
--- a/src/condor_examples/condor.init
+++ b/src/condor_examples/condor.init
@@ -186,14 +186,18 @@ case "$1" in
RETVAL=$?
;;
restart)
- [ $running -eq 0 ] && stop
- start
+ RETVAL=0
+ if [ $running -eq 0 ]; then
+ stop
+ RETVAL=$?
+ fi
+ [ $RETVAL -eq 0 ] && start
RETVAL=$?
;;
condrestart|try-restart)
[ $running -eq 0 ] || exit 0
stop
- start
+ [ $? -eq 0 ] && start
RETVAL=$?
;;
reload|force-reload)
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
C: restart actions unconditionally start and ignore errors from the stop action.
C: It is difficult to script the init script
F: The init script skips a start attempt and reports an error if stopping fails during a restart.
R: The condor sysV init script now reports an error if stopping the service fails during a restart operation.
Retested over all supported architectures x86,x86_64/RHEL5,RHEL6 with:
condor-7.6.1-0.4
Restart function in condor init script changed:
restart)
RETVAL=0
if [ $running -eq 0 ]; then
stop
RETVAL=$?
fi
[ $RETVAL -eq 0 ] && start
RETVAL=$?
;;
# service condor restart
Stopping Condor daemons: [ OK ]
Warning: condor_master may not have exited, start/restart may fail
# echo $?
1
Return code is now correct.
>>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html |