Bug 675725 - initscript lsb compliance
Summary: initscript lsb compliance
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.0
: ---
Assignee: Matthew Farrellee
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On:
Blocks: 693778
TreeView+ depends on / blocked
 
Reported: 2011-02-07 14:27 UTC by Tomas Rusnak
Modified: 2013-04-25 09:11 UTC (History)
4 users (show)

Fixed In Version: condor-7.5.6-0.1
Doc Type: Bug Fix
Doc Text:
C: restart actions unconditionally start and ignore errors from the stop action. C: It is difficult to script the init script F: The init script skips a start attempt and reports an error if stopping fails during a restart. R: The condor sysV init script now reports an error if stopping the service fails during a restart operation.
Clone Of:
Environment:
Last Closed: 2011-06-23 15:39:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0889 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.0 Release 2011-06-23 15:35:53 UTC

Description Tomas Rusnak 2011-02-07 14:27:58 UTC
Description of problem:
According to https://fedoraproject.org/wiki/Packaging/SysVInitScript condor
component init script have lsb compliance issues

Version-Release number of selected component (if applicable):
RH-7.4.5-0.8.el5

How reproducible:
always

Steps to Reproduce:
1. install condor
2. try to restart it few times - service condor restart
3. wait for "Warning: condor_master may not have exited, start/restart may fail" message
4. check init script exit code - echo $?
  
Actual results:
Init script doesn't wait for condor daemons to stop and still return 0 as exit code after 'service condor restart' called.


Expected results:
Init script must to wait for condor daemons or end itself with code other than 0

Additional info:

# service condor restart
Stopping Condor daemons: [  OK  ]
Starting Condor daemons: [  OK  ]
# service condor restart
Stopping Condor daemons: [  OK  ]
Warning: condor_master may not have exited, start/restart may fail
Starting Condor daemons: 
# echo $?
0

Comment 1 Tomas Rusnak 2011-02-07 15:41:18 UTC
Affected code in /etc/init.d/condor:

stop() {
    echo -n $"Stopping Condor daemons: "
    killproc -p $pidfile $prog -QUIT
    RETVAL=$?
    echo
    wait_pid $pidfile 15
    if [ $? -ne 0 ]; then
        # If this happens during a restart the start is likely to see
        # condor still running and just return 0, which means when
        # condor exits it won't be restarted
        echo $"Warning: $prog may not have exited, start/restart may fail"
        RETVAL=1
    fi
    [ $RETVAL -eq 0 ] && rm -f $lockfile
    return $RETVAL
}

Could be wait_pid value changed to something bigger? Can we change it to 60sec, for example? It's could be enough to see all daemons down, and we can be sure that all daemons are successfully down. The 'service condor restart' can take longer to finish, but it will be working. 

In case of return value, we need to change RETVAL!=0, if the restart doesn't finish in threshold. All scripts built on top of our init scripts can benefits from it after. We don't know if deamons are really down or not, now.

Comment 2 Matthew Farrellee 2011-02-09 13:58:40 UTC
Skipping discussion of 15 sec vs 30 sec vs 300 sec for now.

The reason the return value is 0 is because restart = stop;start and the result of stop is discarded. This is also an issue found in the example script from the Fedora link.

Comment 3 Matthew Farrellee 2011-02-09 14:23:15 UTC
I imagine the underlying issue you have is though it makes sense for restart to ignore the result of stop, the stop is initiated and completes after the start completes. The start does not reverse the stop operation, and likely should not.

The warning is there to warn you that your restart attempt may have failed. In fact, restart may have operated as a stop.

Are you trying to program something that relies on the init script return codes?

Comment 4 Tomas Rusnak 2011-02-09 16:09:47 UTC
Yes. We are using init scripts in almost all of our automated tests. 

I understand that restart is stop() followed by start(). The main issue is with RETVAL=0 when stop() failed. 

Following schema can explain better what can be expected in all cases:

       | stop() | start() | restart()
-------------------------------------
RETVAL |   0    |   0     |    0
       |   1    |   -     |    1
       |   0    |   1     |    1

In case of discussion about timer, there is no requirement to change the value again and again and again :) I created following script to enhance stop() function (proof of concept).

Basically the script remember # of running daemons in each iteration and reset counter if shutdown process is still in progress.

stop() {
    echo -n $"Stopping Condor daemons: "
    killproc -p $pidfile $prog -QUIT
    RETVAL=$?
    echo
    DAEMON_TIMER=15
    DAEMONS_COUNT=`ps ax | grep condor | wc -l`
    while [ "$DAEMON_TIMER" -gt "0" ]; do
      echo "waitpid"
      wait_pid $pidfile $DAEMON_TIMER
      if [ $? -ne 0 ]; then
          # If this happens during a restart the start is likely to see
          # condor still running and just return 0, which means when
          # condor exits it won't be restarted
          #echo $"Warning: $prog may not have exited, start/restart may fail"
          DAEMONS_ALIVE=`ps ax | grep condor | wc -l`
          echo "Still $DAEMONS_ALIVE/$DAEMONS_COUNT deamons alive..."

          # we can check if some deamons are down
          if [ $DAEMONS_ALIVE -lt $DAEMONS_COUNT ]; then
            echo "Resetting counter..."
            DAEMONS_COUNT=$DAEMONS_ALIVE
          else
            DAEMON_TIMER=0
          fi
          RETVAL=1
      else
        DAEMON_TIMER=0
      fi;
    done
    [ $RETVAL -eq 0 ] && rm -f $lockfile
    return $RETVAL
}

Let the condor runs about 100 jobs with NUM_CPUS=100 and try restart. Slow machine - better results. 

Does it sounds reasonable for you?

Comment 5 Matthew Farrellee 2011-02-22 20:54:04 UTC
http://refspecs.freestandards.org/LSB_3.1.1/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

The restart, try-restart, reload and force-reload actions may be atomic; that is if a service is known not to be operational after a restart or reload, the script may return an error without any further action.

--

This sounds like it allows for your "1 - 1" row.

Comment 6 Matthew Farrellee 2011-02-22 21:35:32 UTC
commit b99b97a78fa6c5a312b126467fc55bff42f7f90e
Author: Matthew Farrellee <matt@>
Date:   Tue Feb 22 16:33:40 2011 -0500

    Abort restart attempt with error if stop step fails

diff --git a/src/condor_examples/condor.init b/src/condor_examples/condor.init
index db565a9..6d36110 100644
--- a/src/condor_examples/condor.init
+++ b/src/condor_examples/condor.init
@@ -186,14 +186,18 @@ case "$1" in
        RETVAL=$?
        ;;
     restart)
-       [ $running -eq 0 ] && stop
-       start
+       RETVAL=0
+       if [ $running -eq 0 ]; then
+           stop
+           RETVAL=$?
+       fi
+       [ $RETVAL -eq 0 ] && start
        RETVAL=$?
        ;;
     condrestart|try-restart)
        [ $running -eq 0 ] || exit 0
        stop
-       start
+       [ $? -eq 0 ] && start
        RETVAL=$?
        ;;
     reload|force-reload)

Comment 7 Matthew Farrellee 2011-02-23 15:20:37 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: restart actions unconditionally start and ignore errors from the stop action.
C: It is difficult to script the init script
F: The init script skips a start attempt and reports an error if stopping fails during a restart.
R: The condor sysV init script now reports an error if stopping the service fails during a restart operation.

Comment 9 Tomas Rusnak 2011-05-04 08:51:28 UTC
Retested over all supported architectures x86,x86_64/RHEL5,RHEL6 with:

condor-7.6.1-0.4

Restart function in condor init script changed:

    restart)
        RETVAL=0
        if [ $running -eq 0 ]; then
            stop
            RETVAL=$?
        fi
        [ $RETVAL -eq 0 ] && start
        RETVAL=$?
        ;;

# service condor restart
Stopping Condor daemons:                                   [  OK  ]
Warning: condor_master may not have exited, start/restart may fail
# echo $?
1

Return code is now correct.

>>> VERIFIED

Comment 10 errata-xmlrpc 2011-06-23 15:39:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html


Note You need to log in before you can comment on or make changes to this bug.