Bug 800065 - Cumin processes sometimes do not exit and must be killed from master with SIGKILL
Summary: Cumin processes sometimes do not exit and must be killed from master with SIG...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: cumin
Version: Development
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: 2.3
: ---
Assignee: Trevor McKay
QA Contact: Daniel Horák
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-03-05 17:03 UTC by Trevor McKay
Modified: 2013-03-06 18:42 UTC (History)
5 users (show)

Fixed In Version: cumin-0.1.5658-1
Doc Type: Bug Fix
Doc Text:
Cause The cumin service script did not correctly detect the termination of the cumin master process, rather it assumed that the service exited almost immediately after a SIGTERM was issued to the cumin master process. This was not always true, since the cumin child processes might take seconds to shutdown gracefully. Consequence In the case of a restart, the cumin service would not always be restarted. When the 'start' command was issued too soon after the 'stop', the service might still be running from the first invocation and the init.d script would simply return "OK". This could be true in the case of the 'restart' command or in the case of 'stop' and 'start' issued in sequence. Fix The cumin service script now checks the status of the cumin master process after issuing a SIGTERM. If the service does not shutdown within a timeout (5 seconds), the service script will note that the service may not have shutdown completely and will not remove the pidfile. This allows the service 'status' command to be used to check the process status. Existence of a pidfile without a running service indicates a crash or a prolonged shutdown; check /var/log/cumin/master.log for shutdown messages. Failure to shutdown within the timeout is highly unlikely. The cumin service 'stop' command will not return OK until the service has been stopped or the timeout has expired. Additionally, the cumin child processes and the cumin master process should guarantee their own exits, making zombie processes extremely unlikely. Result Shutdown and restart of the cumin service should now be very reliable.
Clone Of:
Environment:
Last Closed: 2013-03-06 18:42:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Notes for testing (16.40 KB, application/vnd.oasis.opendocument.text)
2012-10-05 18:48 UTC, Trevor McKay
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2013:0564 0 normal SHIPPED_LIVE Low: Red Hat Enterprise MRG Grid 2.3 security update 2013-03-06 23:37:09 UTC

Description Trevor McKay 2012-03-05 17:03:31 UTC
Description of problem:

The shutdown code in Cumin attempts to let the child processes run a clean shutdown procedure, rather than just killing them.  The master will wait 5 seconds for all children to exit before sending them SIGKILL.

Sometimes, Cumin processes do not exit on their own and the SIGKILL is issued.  Find out why this is.

Cumin used to just kill children with SIGTERM, which gave them no opportunity for clean shutdown.

Version-Release number of selected component (if applicable):

revision 5236 and above have the updated shutdown code

How reproducible:

Pretty often

Steps to Reproduce:
1.  Start the cumin service
2.  Stop the cumin service
3.  Check /var/log/cumin/master.log for entries indicating that some children did not shut down cleanly and had to be SIGKILLed.
  
Actual results:

Cumin processes do not always exit cleanly.

Expected results:

Cumin processes should exit without SIGKILL

Additional info:

Issuing SIGKILL after a 5 second delay is not really any worse then former behavior (except for the delay).  In previous packages, the children were just given an unhandled SIGTERM signal anyway.  At least in the new code, they are shutting down cleanly most of the time.  Try to improve this so it is 100%.

Comment 1 Stanislav Graf 2012-03-06 07:21:26 UTC
Also we need to wait those 5-10 seconds in 'restart' case.

# rpm -q cumin
cumin-0.1.5233-1.el6.noarch
ecode=0
# service cumin start
Starting cumin:                                            [  OK  ]
ecode=0
# ps -C cumin
  PID TTY          TIME CMD
29559 ?        00:00:00 cumin
ecode=0
# service cumin restart
Stopping cumin:                                            [  OK  ]
Starting cumin:                                            [  OK  ]
ecode=0

But the same process is still running:

# ps -C cumin
  PID TTY          TIME CMD
29559 ?        00:00:00 cumin
ecode=0

wait 5 seconds

# ps -C cumin
  PID TTY          TIME CMD
ecode=1
# service cumin status
cumin is stopped
ecode=3

Comment 2 Luigi Toscano 2012-03-07 19:46:42 UTC
> Sometimes, Cumin processes do not exit on their own and the SIGKILL is issued. 
> Find out why this is.
Any news on the root cause of this?

Comment 3 Trevor McKay 2012-03-13 15:17:11 UTC
No information on the root cause as of yet.  I should have noted the following in the original report:

There was an attempt to fix this in an earlier revision with use of SIGINT.  The SIGINT implementation did not work, resulting in shutdown always ending with a SIGKILL (children never shut down cleanly, every).  This was never released.  The cause here seems to be handling of SIGINT by Python in spawned processes -- academic interest, but not a priority to understand.  

In the new implementation, SIGTERM is used and most of the children shut down cleanly most of the time, but not always.  We need to find out why.  I have a few theories, like signals received while the interpreter is executing C modules (like python-psycopg2).  I also have an idea for an alternate implementation where spawned children in console mode (ie, running as a service) read from stdin directed to a pipe, and wait for the master to write to the pipe to signal a shutdown.  This type of IO may be more reliable than a signal.

Comment 4 Trevor McKay 2012-04-26 14:03:52 UTC
Update, I've had initial success with switching the mechanism to use a stdin pipe from the master to the children.  This is what we'll probably go with.

Note on testing and verification:

Behavior of the master and the children can be seen in /var/log/cumin/master.log.  A failed shutdown will look like this, where one or more children failed to terminate and had to be forcefully killed.  The log notes which children have to be killed.

24313 2012-04-26 08:49:41,477 INFO Send SIGTERM to all children
24313 2012-04-26 08:49:42,743 INFO Subprocess (26095) exited
24313 2012-04-26 08:49:42,744 INFO Subprocess (24316) exited
24313 2012-04-26 08:49:42,744 INFO Subprocess (24318) exited
24313 2012-04-26 08:49:42,744 INFO Subprocess (24319) exited
24313 2012-04-26 08:49:42,744 INFO Subprocess (24320) exited
24313 2012-04-26 08:49:46,496 WARNING Timed out waiting for children, send SIGKILL
24313 2012-04-26 08:49:46,496 WARNING Send SIGKILL to subprocess (24317)


A successful shutdown looks like this:

4407 2012-04-26 10:02:59,815 INFO Send SIGTERM to all children
4407 2012-04-26 10:03:00,066 INFO Subprocess (4412) exited
4407 2012-04-26 10:03:00,067 INFO Subprocess (4413) exited
4407 2012-04-26 10:03:00,067 INFO Subprocess (4414) exited
4407 2012-04-26 10:03:00,067 INFO Subprocess (4415) exited
4407 2012-04-26 10:03:00,067 INFO All children exited

Comment 5 Trevor McKay 2012-05-15 00:17:45 UTC
Fixed in revision 5375.

New options for cumin-data, cumin-web, and cumin-report:

--es=whatever

If this option is set, the app will use select to read stdin.  If data shows up on stdin that matches 'whatever', the app will exit.  The master writes to stdin through a pipe to send the termination signal to children.  This replaces the use of signals.

If --es is not set, stdin is ignored.

--tm=timeout

These apps run the shutdown routine in a separate thread.  If that thread doesn't complete within 'timeout' seconds, the main thread will simply exit.  This is possible because the shutdown thread is launched as a daemon thread.  So, the child apps now guarantee their own exit without a KILL from the master.

New options for the cumin master script:

--timeout

The master script passes this value to the children as the --tm option.  It also will wait for timeout+1 seconds for children to exit after the termination string has been sent before exiting itself.  This should all the init.d script to reliably use the cumin master pid as an indicator of whether or not the service is running.

Worst case, if a child was delayed in exiting and a new service was started, they may share a log file until the first child exits.  Or, in the case of cumin-web, the socket would probably still be bound and the new cumin-web would experience restarts until the old exited.

But, with the shutdown routine in a separate thread in the children, this is highly unlikely.


Here is the commit comment:

Comprehensive changes to the shutdown mechanism using stdin inside of signals.
* Master communicates with children over stdin (redirected to a pipe).
* If a child sees input on stdin matching the --es option value, it will exit.
* Children launch shutdown routines in a daemon thread and call join() from the main thread 
*  with a timeout of --tm. This guarantees exit after --tm seconds.
* The master sets the --tm option for children based on its own --timeout option.
* The master will exit after --timeout seconds, even if children have not exited
*  since they are guaranteed to do so. This allows initd to to track the pid of
*  the master script with confidence.
* Initd script uses the killproc form that sends a SIGTERM and checks the pid
*  with a delay before issuing SIGKILL.  The delay is the same value that is
*  passed as the --timeout option to the master.
* Artificial 3 second delay between stop and start on restart has been removed.

Comment 6 Trevor McKay 2012-05-15 00:29:49 UTC
Note, contrary to comment 4,

there is no longer any case where the master will send SIGKILL to the children.  The children are expected to exit themselves (there is no reason why they should not).

If the master exits before the children, it will look like this:

28113 2012-05-14 18:48:30,194 INFO Write termination string to all children
28113 2012-05-14 18:48:30,445 INFO Subprocess (28122) exited
28113 2012-05-14 18:48:31,196 WARNING Timed out waiting for children, exiting
28113 2012-05-14 18:48:31,197 WARNING Subprocess (28118) failed to stop
28113 2012-05-14 18:48:31,197 WARNING Subprocess (28119) failed to stop
28113 2012-05-14 18:48:31,197 WARNING Subprocess (28120) failed to stop
28113 2012-05-14 18:48:31,197 WARNING Subprocess (28121) failed to stop

This is actually unlikely, the above was generated with a whitebox test where the master and children used a different timeout.  With them using the same timeout, this should be really, really unusual.

Default timeout is 5 seconds.  /etc/init.d/cumin also sets the --timeout argument to the master at 5 seconds.

Comment 7 Trevor McKay 2012-05-15 00:41:12 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    The cumin service script did not correctly detect the termination of the cumin master process, rather it assumed that the service exited almost immediately after a SIGTERM was issued to the cumin master process.  This was not always true, since the cumin child processes might take seconds to shutdown gracefully.

Consequence
    In the case of a restart, the cumin service would not always be restarted.  When the 'start' command was issued too soon after the 'stop', the service might still be running from the first invocation and the init.d script would simply return "OK".  This could be true in the case of the 'restart' command or in the case of 'stop' and 'start' issued in sequence. 

Fix
    The cumin service script now checks the status of the cumin master process after issuing a SIGTERM.  If the service does not shutdown after a certain timeout, it will issue a SIGKILL.  The cumin service 'stop' command will not return OK until the service has been stopped.  Additionally, the cumin child processes and the cumin master process should guarantee their own exits, making SIGKILL from the service script or zombie processes extremely unlikely.

Result
    Shutdown and restart of the cumin service should now be very reliably.

Comment 8 Trevor McKay 2012-05-18 14:46:14 UTC
Changes to note in comment 5

* Initd script uses the killproc form that sends a SIGTERM and checks the pid
*  with a delay before issuing SIGKILL.  The delay is the same value that is
*  passed as the --timeout option to the master.

revision 5383

note, this is no longer true in the latest revision.  With the addition of /usr/sbin/cumin-checkpid which takes a delay, there is no need to use the alternate KILLPROC form.  Initd calls /usr/sbin/cumin-checkpid --exit --timeout=$DELAY and waits for a result.

Comment 9 Trevor McKay 2012-05-18 14:55:21 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -5,7 +5,9 @@
     In the case of a restart, the cumin service would not always be restarted.  When the 'start' command was issued too soon after the 'stop', the service might still be running from the first invocation and the init.d script would simply return "OK".  This could be true in the case of the 'restart' command or in the case of 'stop' and 'start' issued in sequence. 
 
 Fix
-    The cumin service script now checks the status of the cumin master process after issuing a SIGTERM.  If the service does not shutdown after a certain timeout, it will issue a SIGKILL.  The cumin service 'stop' command will not return OK until the service has been stopped.  Additionally, the cumin child processes and the cumin master process should guarantee their own exits, making SIGKILL from the service script or zombie processes extremely unlikely.
+    The cumin service script now checks the status of the cumin master process after issuing a SIGTERM.  If the service does not shutdown within a timeout (5 seconds), the service script will note that the service may not have shutdown completely and will not remove the pidfile.  This allows the service 'status' command to be used to check the process status.  Existence of a pidfile without a running service indicates a crash or a prolonged shutdown; check /var/log/cumin/master.log for shutdown messages.  Failure to shutdown within the timeout is highly unlikely.  
 
+  The cumin service 'stop' command will not return OK until the service has been stopped or the timeout has expired.  Additionally, the cumin child processes and the cumin master process should guarantee their own exits, making zombie processes extremely unlikely.
+
 Result
-    Shutdown and restart of the cumin service should now be very reliably.+    Shutdown and restart of the cumin service should now be very reliable.

Comment 11 Trevor McKay 2012-10-05 17:46:01 UTC
Modified in revision 5495.

The master script waits for 0.5 seconds longer than the timeout value for the children before timing out itself on shutdown.  

This revision also makes the initd script wait for 1.0 seconds longer than the timeout for the children (or 0.5 seconds longer than the timeout for the master).

The padding between master/children and initd/master make it less likely that any kind of failure to stop conditions will be reported during shutdown.

Comment 12 Trevor McKay 2012-10-05 18:48:31 UTC
Created attachment 622383 [details]
Notes for testing

This attachment contains background information on how the mechanism works and information that hopefully will aid in testing.  It shows several startup/shutdown scenarios and expected output (commands, log files, .init.files).

This mechanism evolved as BZ comments were written, this attachment should pull it all together and give a final acount of how the shutdown mechanism is working.

Comment 14 Trevor McKay 2013-01-15 18:25:23 UTC
Assess whether or not to put SIGKILL after delay back in the master script if a child fails to exit, setting needinfo as a heads up.

There are still cases where a child may not exit.  New cases can potentially be added any time code is updated.  Additionally, empirically it seems that in Python sys.exit() does not always cause an exit unconditionally.  Check into whether this is expected, a known issue, etc.

This is particularly bad in the case of cumin-web.  When a previous instance fails to exit, it locks the port and the new web cannot start.

Hung children can be left, and sysvinit (and systemd later) can be fixed up to report hung children to the user, service restart can be fixed up to note the error, etc.  The service script doesn't currently do a good job of reporting this.

Or, the master can go back to a final SIGKILL upon failure and we can treat each failure to shutdown case as a separate, specific issue that should be ameliorated or eliminated.  I removed the SIGKILL originally because even a 5 second timeout seems arbitrary and I believed children would shutdown eventually, even if late.  But it seems that isn't true, depending on the error.  I believe the SIGKILL is the best defense against failure to start.

Comment 15 Trevor McKay 2013-01-15 20:57:57 UTC
See also Bug 895689

This is the situation that prompted Comment 14 and the desire to have the master issue SIGKILL in addition to the log message on a failed shutdown by a child.

Comment 17 Trevor McKay 2013-01-18 18:07:22 UTC
Fixed in 5658.

Included child exit status in log messages.
Issue SIGKILL after timeout to children who have still not stopped.
Wait up to an additional timeout to report child termination after SIGKILL.
Exit with warning if there are still child processes alive after all that.

Modified: trunk/cumin/bin/cumin
===================================================================
--- trunk/cumin/bin/cumin       2013-01-18 15:32:12 UTC (rev 5657)
+++ trunk/cumin/bin/cumin       2013-01-18 18:02:31 UTC (rev 5658)
@@ -320,17 +320,18 @@
                               % app[PROCESS].pid)
 
         # Wait for children to exit so we can record the result and so that
-        # the initd script can track the pid correctly.  Exit regardless after
+        # the initd script can track the pid correctly.  Issue SIGKILL after
         # timeout+0.5 seconds, even if the children are still running.  Each child
         # will have a hard exit after timeout seconds from its own main thread,
         # even if its shutdown activities have not completed, so in most
         # cases the children *should* exit within timeout+0.5 seconds
         then = datetime.now()
+        sent_kill = False
         while complete != 0:        
             for app in apps:
                 poll = app[PROCESS] and app[PROCESS].poll()
                 if poll is not None:
-                    log.info("Subprocess (%s) exited", app[PROCESS].pid)
+                    log.info("Subprocess (%s) exited (%s)" % (app[PROCESS].pid, poll))
                     app[PROCESS] = None
                     complete -= 1
 
@@ -340,12 +341,24 @@
         
             sleep(0.25)
             if datetime.now() - then > timedelta(seconds=options.timeout+0.5):
-                log.warn("Timed out waiting for children, exiting")
-                for app in apps:  # just to be paranoid
-                    if app[PROCESS]:
-                        log.warn("Subprocess (%s) failed to stop"\
-                                 % app[PROCESS].pid)
-                break
+                if sent_kill:
+                    log.warn("Timed out waiting for children, exiting")
+                    for app in apps:
+                        if app[PROCESS]:
+                            log.warn("Subprocess (%s) failed to stop"\
+                                     % app[PROCESS].pid)
+                    break
+                else:
+                    log.warn("Timed out waiting for children, sending SIGKILL")
+                    for app in apps:  # just to be paranoid
+                        if app[PROCESS]:
+                            os.kill(app[PROCESS].pid, signal.SIGKILL)
+                            log.warn("Subprocess (%s) failed to stop, "\
+                                     "sending SIGKILL" % (app[PROCESS].pid))
+
+                    # Okay, let's loop one more time and report
+                    sent_kill = True
+                    then = datetime.now()
     
         write_status_path(config.get_init_status_path(), "exit\n", append=True)
     return return_code

Comment 18 Trevor McKay 2013-01-18 19:48:29 UTC
Update on testing and log content.

1) Normal termination of the cumin service with a default configuration will produce log messages like the following in /var/log/cumin/master.log.  The only difference here is the addition of the exit status (all are (0) below):

service cumin start
(wait)
service cumin stop

look in /var/log/cumin/master.log

10728 2013-01-18 13:51:22,534 INFO Write termination string to all children
10728 2013-01-18 13:51:23,537 INFO Subprocess (10741) exited (0)
10728 2013-01-18 13:51:23,537 INFO Subprocess (10743) exited (0)
10728 2013-01-18 13:51:24,039 INFO Subprocess (10740) exited (0)
10728 2013-01-18 13:51:24,541 INFO Subprocess (10742) exited (0)
10728 2013-01-18 13:51:24,542 INFO Subprocess (10744) exited (0)
10728 2013-01-18 13:51:26,045 INFO Subprocess (10739) exited (0)
10728 2013-01-18 13:51:26,046 INFO All children exited

Different ways to cause an abnormal termination case:
-----------------------------------------------------

2) To produce a case with the default installation where the master actually times out and the child is killed, we need a real case with a "hung" child.  Bug 895689 would be a good candidate for this (seems to be relatively easy to reproduce on RHEL 6)

It seems to help when child processes are very busy.  For example, use a pool with a lot of activity and a bunch of wallaby nodes/features.  tail -f /var/log/cumin/web.log with debug level logging on and wait until WallabyOperations has started to refresh nodes, features, tags, etc.  Then 'service cumin stop'

3) The odds of an abnormal termination case may be increased by shortening the master timeout to 0.  This can be done by putting the following line in /etc/sysconfig/cumin:

CUMIN_OPTIONS="--timeout=0"

This will cause the master to wait for 0 seconds before issuing SIGKILL to children who have not stopped.  However, it will also cause the children to only wait 0 seconds before exiting themselves with sys.exit, so some children may complete with a normal termination status (0) and not exit do to the signal.

service cumin start
(wait)
service cumin stop

look in var/log/cumin/master.log

(in this example the master noticed that 11706 and 11711 had not exited and issued SIGKILL, then waited for exit status.  These two processes actually finished their normal termination and so exited with status 0).

11695 2013-01-18 14:06:07,316 INFO Write termination string to all children
11695 2013-01-18 14:06:07,567 INFO Subprocess (11707) exited (0)
11695 2013-01-18 14:06:07,567 INFO Subprocess (11708) exited (0)
11695 2013-01-18 14:06:07,567 INFO Subprocess (11709) exited (0)
11695 2013-01-18 14:06:07,568 INFO Subprocess (11710) exited (0)
11695 2013-01-18 14:06:07,818 WARNING Timed out waiting for children, sending SIGKILL
11695 2013-01-18 14:06:07,819 WARNING Subprocess (11706) failed to stop, sending SIGKILL
11695 2013-01-18 14:06:07,819 WARNING Subprocess (11711) failed to stop, sending SIGKILL
11695 2013-01-18 14:06:07,819 INFO Subprocess (11706) exited (0)
11695 2013-01-18 14:06:07,820 INFO Subprocess (11711) exited (0)
11695 2013-01-18 14:06:07,820 INFO All children exited


4) If a whitebox test is okay, the following is another option.  Modify the /usr/bin/cumin script to always pass a timeout value of 5 to the children.  Use the /etc/sysconfig/cumin file to change the master's timeout to 0.  This will cause the master to timeout immediately but the children will still be executing their shutdown procedures for up to 5 seconds.

a) This sed command will change the following line which sets the child timeout
    args = [app, "--section="+section.strip(), "--es="+es.strip(), "--tm="+str(timeout)]

# sed -i s/str\(timeout\)/str\(5\)/ /usr/bin/cumin

b) Set the following value in /etc/sysconfig/cumin:

CUMIN_OPTIONS="--timeout=0"

c) 

service cumin start
(wait)
service cumin stop

Look in /var/log/cumin/master.log for messages like the following.  I included the last line of child startup to verify that the children have a timeout value of 5 after the sed command.  Notice the status of each child is -9 which indicates termination from a signal.

...
12402 2013-01-18 14:34:47,438 INFO Started subprocess (pid 12418):  cumin-data --section=data.other --es=exit --tm=5 --daemon
12402 2013-01-18 14:34:54,739 INFO Write termination string to all children
12402 2013-01-18 14:34:55,242 WARNING Timed out waiting for children, sending SIGKILL
12402 2013-01-18 14:34:55,243 WARNING Subprocess (12413) failed to stop, sending SIGKILL
12402 2013-01-18 14:34:55,244 WARNING Subprocess (12414) failed to stop, sending SIGKILL
12402 2013-01-18 14:34:55,244 WARNING Subprocess (12415) failed to stop, sending SIGKILL
12402 2013-01-18 14:34:55,252 WARNING Subprocess (12416) failed to stop, sending SIGKILL
12402 2013-01-18 14:34:55,258 WARNING Subprocess (12417) failed to stop, sending SIGKILL
12402 2013-01-18 14:34:55,261 WARNING Subprocess (12418) failed to stop, sending SIGKILL
12402 2013-01-18 14:34:55,262 INFO Subprocess (12414) exited (-9)
12402 2013-01-18 14:34:55,262 INFO Subprocess (12415) exited (-9)
12402 2013-01-18 14:34:55,262 INFO Subprocess (12416) exited (-9)
12402 2013-01-18 14:34:55,262 INFO Subprocess (12417) exited (-9)
12402 2013-01-18 14:34:55,263 INFO Subprocess (12418) exited (-9)
12402 2013-01-18 14:34:55,513 INFO Subprocess (12413) exited (-9)
12402 2013-01-18 14:34:55,514 INFO All children exited

Comment 19 Trevor McKay 2013-01-18 20:29:12 UTC
fyi, here is a log of a shutdown with a real occurrence of Bug 895689 in the wild using a local build of the rpm:

14757 2013-01-18 15:15:16,902 INFO Started subprocess (pid 14768):  cumin-web --section=web --es=exit --tm=5 --daemon
14757 2013-01-18 15:15:16,905 INFO Started subprocess (pid 14769):  cumin-data --section=data.grid --es=exit --tm=5 --daemon
14757 2013-01-18 15:15:16,924 INFO Started subprocess (pid 14770):  cumin-data --section=data.grid-slots --es=exit --tm=5 --daemon
14757 2013-01-18 15:15:16,934 INFO Started subprocess (pid 14771):  cumin-data --section=data.grid-submissions --es=exit --tm=5 --daemon
14757 2013-01-18 15:15:16,944 INFO Started subprocess (pid 14772):  cumin-data --section=data.sesame --es=exit --tm=5 --daemon
14757 2013-01-18 15:15:16,953 INFO Started subprocess (pid 14773):  cumin-data --section=data.other --es=exit --tm=5 --daemon
14757 2013-01-18 15:18:49,604 INFO Write termination string to all children
14757 2013-01-18 15:18:50,607 INFO Subprocess (14770) exited (0)
14757 2013-01-18 15:18:50,738 INFO Subprocess (14771) exited (0)
14757 2013-01-18 15:18:50,990 INFO Subprocess (14772) exited (0)
14757 2013-01-18 15:18:51,241 INFO Subprocess (14769) exited (0)
14757 2013-01-18 15:18:54,750 INFO Subprocess (14773) exited (0)
14757 2013-01-18 15:18:55,252 WARNING Timed out waiting for children, sending SIGKILL
14757 2013-01-18 15:18:55,252 WARNING Subprocess (14768) failed to stop, sending SIGKILL
14757 2013-01-18 15:18:55,503 INFO Subprocess (14768) exited (-9)
14757 2013-01-18 15:18:55,504 INFO All children exited

Entries from the web.log showing failure to stop and the new process after the restart:

14768 2013-01-18 15:18:49,627 INFO Stopping Cumin
14768 2013-01-18 15:18:49,627 DEBUG WallabyOperations: waiting for cache maintenance thread to exit
14768 2013-01-18 15:18:50,387 INFO Request GET /index.update?session=index.html%3Fframe%3Dmain.grid.tag%3Bmain.grid.m%3Dtag%3Bmain.grid.id%3D24%3Bmain.grid.view.body.m%3Dtagi%3Bmain.grid.tag.m%3Dview%3Bmain.grid.tag.id%3Dfoobar;widget=main.notifications;widget=main.grid.tag.view.body.tagoverview.taggen
14768 2013-01-18 15:18:50,393 INFO Response 200 OK
14768 2013-01-18 15:18:50,394 DEBUG Response headers:
14768 2013-01-18 15:18:50,394 DEBUG   Content-Length            1173
14768 2013-01-18 15:18:50,394 DEBUG   Content-Type              text/xml
14768 2013-01-18 15:18:50,394 DEBUG   Cache-Control             no-cache
14768 2013-01-18 15:18:54,627 INFO Shutdown thread timed out, exiting
15107 2013-01-18 15:19:04,443 INFO Checking Cumin

Comment 21 Daniel Horák 2013-01-31 13:15:55 UTC
Tested along to comment 18 and 19 on RHEL 5.9/6.4 - i386/x86_64 with cumin-0.1.5658-1. It behaves as expected - service cumin stop doesn't finish with OK status until the service is properly stopped (or killed after timeout - see comment 14 and 15).

# tail -F /var/log/cumin/master.log
  10179 2013-01-31 13:57:43,257 INFO Write termination string to all children
  10179 2013-01-31 13:57:43,778 INFO Subprocess (10194) exited (0)
  10179 2013-01-31 13:57:44,282 INFO Subprocess (10190) exited (0)
  10179 2013-01-31 13:57:44,283 INFO Subprocess (10191) exited (0)
  10179 2013-01-31 13:57:44,542 INFO Subprocess (10192) exited (0)
  10179 2013-01-31 13:57:44,542 INFO Subprocess (10195) exited (0)
  10179 2013-01-31 13:57:44,797 INFO Subprocess (10193) exited (0)
  10179 2013-01-31 13:57:44,797 INFO All children exited


# tail -F /var/log/cumin/web.log 
  17026 2013-01-31 14:08:13,606 DEBUG WallabyOperations: refreshing features
  17026 2013-01-31 14:08:28,160 DEBUG Exiting on termination string from master
  17026 2013-01-31 14:08:28,180 INFO Stopping Cumin
  17026 2013-01-31 14:08:28,181 INFO Shutdown thread timed out, exiting

# tail -F /var/log/cumin/master.log
  17002 2013-01-31 14:07:45,717 INFO Started subprocess (pid 17026):  cumin-web --section=web --es=exit --tm=0 --daemon
  17002 2013-01-31 14:07:45,744 INFO Started subprocess (pid 17027):  cumin-data --section=data.grid --es=exit --tm=0 --daemon
  17002 2013-01-31 14:07:45,770 INFO Started subprocess (pid 17028):  cumin-data --section=data.grid-slots --es=exit --tm=0 --daemon
  17002 2013-01-31 14:07:45,846 INFO Started subprocess (pid 17029):  cumin-data --section=data.grid-submissions --es=exit --tm=0 --daemon
  17002 2013-01-31 14:07:45,960 INFO Started subprocess (pid 17030):  cumin-data --section=data.sesame --es=exit --tm=0 --daemon
  17002 2013-01-31 14:08:28,159 INFO Write termination string to all children
  17002 2013-01-31 14:08:28,410 INFO Subprocess (17027) exited (0)
  17002 2013-01-31 14:08:28,410 INFO Subprocess (17028) exited (0)
  17002 2013-01-31 14:08:28,411 INFO Subprocess (17030) exited (0)
  17002 2013-01-31 14:08:28,661 WARNING Timed out waiting for children, sending SIGKILL
  17002 2013-01-31 14:08:28,662 WARNING Subprocess (17026) failed to stop, sending SIGKILL
  17002 2013-01-31 14:08:28,662 WARNING Subprocess (17029) failed to stop, sending SIGKILL
  17002 2013-01-31 14:08:28,662 INFO Subprocess (17029) exited (0)
  17002 2013-01-31 14:08:28,920 INFO Subprocess (17026) exited (-9)
  17002 2013-01-31 14:08:28,920 INFO All children exited


# tail -F /var/log/cumin/web.log 
  21730 2013-01-31 13:36:09,375 DEBUG WallabyOperations: refreshing features
  21730 2013-01-31 13:36:25,651 DEBUG Exiting on termination string from master
  21730 2013-01-31 13:36:25,668 INFO Stopping Cumin
  21730 2013-01-31 13:36:25,668 DEBUG WallabyOperations: waiting for cache maintenance thread to exit
  21730 2013-01-31 13:36:30,690 INFO Shutdown thread timed out, exiting

# tail -F /var/log/cumin/master.log 
  21712 2013-01-31 13:12:47,751 INFO Started subprocess (pid 21730):  cumin-web --section=web --es=exit --tm=5 --daemon
  21712 2013-01-31 13:12:47,778 INFO Started subprocess (pid 21732):  cumin-data --section=data.grid --es=exit --tm=5 --daemon
  21712 2013-01-31 13:12:47,817 INFO Started subprocess (pid 21734):  cumin-data --section=data.grid-slots --es=exit --tm=5 --daemon
  21712 2013-01-31 13:12:47,984 INFO Started subprocess (pid 21735):  cumin-data --section=data.grid-submissions --es=exit --tm=5 --daemon
  21712 2013-01-31 13:12:48,164 INFO Started subprocess (pid 21739):  cumin-data --section=data.sesame --es=exit --tm=5 --daemon
  21712 2013-01-31 13:36:25,622 INFO Write termination string to all children
  21712 2013-01-31 13:36:26,139 INFO Subprocess (21739) exited (0)
  21712 2013-01-31 13:36:26,390 INFO Subprocess (21735) exited (0)
  21712 2013-01-31 13:36:26,640 INFO Subprocess (21732) exited (0)
  21712 2013-01-31 13:36:26,897 INFO Subprocess (21734) exited (0)
  21712 2013-01-31 13:36:31,199 WARNING Timed out waiting for children, sending SIGKILL
  21712 2013-01-31 13:36:31,199 WARNING Subprocess (21730) failed to stop, sending SIGKILL
  21712 2013-01-31 13:36:31,200 INFO Subprocess (21730) exited (0)
  21712 2013-01-31 13:36:31,200 INFO All children exited

>>> VERIFIED

Comment 23 errata-xmlrpc 2013-03-06 18:42:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html


Note You need to log in before you can comment on or make changes to this bug.