Description of problem: The shutdown code in Cumin attempts to let the child processes run a clean shutdown procedure, rather than just killing them. The master will wait 5 seconds for all children to exit before sending them SIGKILL. Sometimes, Cumin processes do not exit on their own and the SIGKILL is issued. Find out why this is. Cumin used to just kill children with SIGTERM, which gave them no opportunity for clean shutdown. Version-Release number of selected component (if applicable): revision 5236 and above have the updated shutdown code How reproducible: Pretty often Steps to Reproduce: 1. Start the cumin service 2. Stop the cumin service 3. Check /var/log/cumin/master.log for entries indicating that some children did not shut down cleanly and had to be SIGKILLed. Actual results: Cumin processes do not always exit cleanly. Expected results: Cumin processes should exit without SIGKILL Additional info: Issuing SIGKILL after a 5 second delay is not really any worse then former behavior (except for the delay). In previous packages, the children were just given an unhandled SIGTERM signal anyway. At least in the new code, they are shutting down cleanly most of the time. Try to improve this so it is 100%.
Also we need to wait those 5-10 seconds in 'restart' case. # rpm -q cumin cumin-0.1.5233-1.el6.noarch ecode=0 # service cumin start Starting cumin: [ OK ] ecode=0 # ps -C cumin PID TTY TIME CMD 29559 ? 00:00:00 cumin ecode=0 # service cumin restart Stopping cumin: [ OK ] Starting cumin: [ OK ] ecode=0 But the same process is still running: # ps -C cumin PID TTY TIME CMD 29559 ? 00:00:00 cumin ecode=0 wait 5 seconds # ps -C cumin PID TTY TIME CMD ecode=1 # service cumin status cumin is stopped ecode=3
> Sometimes, Cumin processes do not exit on their own and the SIGKILL is issued. > Find out why this is. Any news on the root cause of this?
No information on the root cause as of yet. I should have noted the following in the original report: There was an attempt to fix this in an earlier revision with use of SIGINT. The SIGINT implementation did not work, resulting in shutdown always ending with a SIGKILL (children never shut down cleanly, every). This was never released. The cause here seems to be handling of SIGINT by Python in spawned processes -- academic interest, but not a priority to understand. In the new implementation, SIGTERM is used and most of the children shut down cleanly most of the time, but not always. We need to find out why. I have a few theories, like signals received while the interpreter is executing C modules (like python-psycopg2). I also have an idea for an alternate implementation where spawned children in console mode (ie, running as a service) read from stdin directed to a pipe, and wait for the master to write to the pipe to signal a shutdown. This type of IO may be more reliable than a signal.
Update, I've had initial success with switching the mechanism to use a stdin pipe from the master to the children. This is what we'll probably go with. Note on testing and verification: Behavior of the master and the children can be seen in /var/log/cumin/master.log. A failed shutdown will look like this, where one or more children failed to terminate and had to be forcefully killed. The log notes which children have to be killed. 24313 2012-04-26 08:49:41,477 INFO Send SIGTERM to all children 24313 2012-04-26 08:49:42,743 INFO Subprocess (26095) exited 24313 2012-04-26 08:49:42,744 INFO Subprocess (24316) exited 24313 2012-04-26 08:49:42,744 INFO Subprocess (24318) exited 24313 2012-04-26 08:49:42,744 INFO Subprocess (24319) exited 24313 2012-04-26 08:49:42,744 INFO Subprocess (24320) exited 24313 2012-04-26 08:49:46,496 WARNING Timed out waiting for children, send SIGKILL 24313 2012-04-26 08:49:46,496 WARNING Send SIGKILL to subprocess (24317) A successful shutdown looks like this: 4407 2012-04-26 10:02:59,815 INFO Send SIGTERM to all children 4407 2012-04-26 10:03:00,066 INFO Subprocess (4412) exited 4407 2012-04-26 10:03:00,067 INFO Subprocess (4413) exited 4407 2012-04-26 10:03:00,067 INFO Subprocess (4414) exited 4407 2012-04-26 10:03:00,067 INFO Subprocess (4415) exited 4407 2012-04-26 10:03:00,067 INFO All children exited
Fixed in revision 5375. New options for cumin-data, cumin-web, and cumin-report: --es=whatever If this option is set, the app will use select to read stdin. If data shows up on stdin that matches 'whatever', the app will exit. The master writes to stdin through a pipe to send the termination signal to children. This replaces the use of signals. If --es is not set, stdin is ignored. --tm=timeout These apps run the shutdown routine in a separate thread. If that thread doesn't complete within 'timeout' seconds, the main thread will simply exit. This is possible because the shutdown thread is launched as a daemon thread. So, the child apps now guarantee their own exit without a KILL from the master. New options for the cumin master script: --timeout The master script passes this value to the children as the --tm option. It also will wait for timeout+1 seconds for children to exit after the termination string has been sent before exiting itself. This should all the init.d script to reliably use the cumin master pid as an indicator of whether or not the service is running. Worst case, if a child was delayed in exiting and a new service was started, they may share a log file until the first child exits. Or, in the case of cumin-web, the socket would probably still be bound and the new cumin-web would experience restarts until the old exited. But, with the shutdown routine in a separate thread in the children, this is highly unlikely. Here is the commit comment: Comprehensive changes to the shutdown mechanism using stdin inside of signals. * Master communicates with children over stdin (redirected to a pipe). * If a child sees input on stdin matching the --es option value, it will exit. * Children launch shutdown routines in a daemon thread and call join() from the main thread * with a timeout of --tm. This guarantees exit after --tm seconds. * The master sets the --tm option for children based on its own --timeout option. * The master will exit after --timeout seconds, even if children have not exited * since they are guaranteed to do so. This allows initd to to track the pid of * the master script with confidence. * Initd script uses the killproc form that sends a SIGTERM and checks the pid * with a delay before issuing SIGKILL. The delay is the same value that is * passed as the --timeout option to the master. * Artificial 3 second delay between stop and start on restart has been removed.
Note, contrary to comment 4, there is no longer any case where the master will send SIGKILL to the children. The children are expected to exit themselves (there is no reason why they should not). If the master exits before the children, it will look like this: 28113 2012-05-14 18:48:30,194 INFO Write termination string to all children 28113 2012-05-14 18:48:30,445 INFO Subprocess (28122) exited 28113 2012-05-14 18:48:31,196 WARNING Timed out waiting for children, exiting 28113 2012-05-14 18:48:31,197 WARNING Subprocess (28118) failed to stop 28113 2012-05-14 18:48:31,197 WARNING Subprocess (28119) failed to stop 28113 2012-05-14 18:48:31,197 WARNING Subprocess (28120) failed to stop 28113 2012-05-14 18:48:31,197 WARNING Subprocess (28121) failed to stop This is actually unlikely, the above was generated with a whitebox test where the master and children used a different timeout. With them using the same timeout, this should be really, really unusual. Default timeout is 5 seconds. /etc/init.d/cumin also sets the --timeout argument to the master at 5 seconds.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause The cumin service script did not correctly detect the termination of the cumin master process, rather it assumed that the service exited almost immediately after a SIGTERM was issued to the cumin master process. This was not always true, since the cumin child processes might take seconds to shutdown gracefully. Consequence In the case of a restart, the cumin service would not always be restarted. When the 'start' command was issued too soon after the 'stop', the service might still be running from the first invocation and the init.d script would simply return "OK". This could be true in the case of the 'restart' command or in the case of 'stop' and 'start' issued in sequence. Fix The cumin service script now checks the status of the cumin master process after issuing a SIGTERM. If the service does not shutdown after a certain timeout, it will issue a SIGKILL. The cumin service 'stop' command will not return OK until the service has been stopped. Additionally, the cumin child processes and the cumin master process should guarantee their own exits, making SIGKILL from the service script or zombie processes extremely unlikely. Result Shutdown and restart of the cumin service should now be very reliably.
Changes to note in comment 5 * Initd script uses the killproc form that sends a SIGTERM and checks the pid * with a delay before issuing SIGKILL. The delay is the same value that is * passed as the --timeout option to the master. revision 5383 note, this is no longer true in the latest revision. With the addition of /usr/sbin/cumin-checkpid which takes a delay, there is no need to use the alternate KILLPROC form. Initd calls /usr/sbin/cumin-checkpid --exit --timeout=$DELAY and waits for a result.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -5,7 +5,9 @@ In the case of a restart, the cumin service would not always be restarted. When the 'start' command was issued too soon after the 'stop', the service might still be running from the first invocation and the init.d script would simply return "OK". This could be true in the case of the 'restart' command or in the case of 'stop' and 'start' issued in sequence. Fix - The cumin service script now checks the status of the cumin master process after issuing a SIGTERM. If the service does not shutdown after a certain timeout, it will issue a SIGKILL. The cumin service 'stop' command will not return OK until the service has been stopped. Additionally, the cumin child processes and the cumin master process should guarantee their own exits, making SIGKILL from the service script or zombie processes extremely unlikely. + The cumin service script now checks the status of the cumin master process after issuing a SIGTERM. If the service does not shutdown within a timeout (5 seconds), the service script will note that the service may not have shutdown completely and will not remove the pidfile. This allows the service 'status' command to be used to check the process status. Existence of a pidfile without a running service indicates a crash or a prolonged shutdown; check /var/log/cumin/master.log for shutdown messages. Failure to shutdown within the timeout is highly unlikely. + The cumin service 'stop' command will not return OK until the service has been stopped or the timeout has expired. Additionally, the cumin child processes and the cumin master process should guarantee their own exits, making zombie processes extremely unlikely. + Result - Shutdown and restart of the cumin service should now be very reliably.+ Shutdown and restart of the cumin service should now be very reliable.
Modified in revision 5495. The master script waits for 0.5 seconds longer than the timeout value for the children before timing out itself on shutdown. This revision also makes the initd script wait for 1.0 seconds longer than the timeout for the children (or 0.5 seconds longer than the timeout for the master). The padding between master/children and initd/master make it less likely that any kind of failure to stop conditions will be reported during shutdown.
Created attachment 622383 [details] Notes for testing This attachment contains background information on how the mechanism works and information that hopefully will aid in testing. It shows several startup/shutdown scenarios and expected output (commands, log files, .init.files). This mechanism evolved as BZ comments were written, this attachment should pull it all together and give a final acount of how the shutdown mechanism is working.
Assess whether or not to put SIGKILL after delay back in the master script if a child fails to exit, setting needinfo as a heads up. There are still cases where a child may not exit. New cases can potentially be added any time code is updated. Additionally, empirically it seems that in Python sys.exit() does not always cause an exit unconditionally. Check into whether this is expected, a known issue, etc. This is particularly bad in the case of cumin-web. When a previous instance fails to exit, it locks the port and the new web cannot start. Hung children can be left, and sysvinit (and systemd later) can be fixed up to report hung children to the user, service restart can be fixed up to note the error, etc. The service script doesn't currently do a good job of reporting this. Or, the master can go back to a final SIGKILL upon failure and we can treat each failure to shutdown case as a separate, specific issue that should be ameliorated or eliminated. I removed the SIGKILL originally because even a 5 second timeout seems arbitrary and I believed children would shutdown eventually, even if late. But it seems that isn't true, depending on the error. I believe the SIGKILL is the best defense against failure to start.
See also Bug 895689 This is the situation that prompted Comment 14 and the desire to have the master issue SIGKILL in addition to the log message on a failed shutdown by a child.
Fixed in 5658. Included child exit status in log messages. Issue SIGKILL after timeout to children who have still not stopped. Wait up to an additional timeout to report child termination after SIGKILL. Exit with warning if there are still child processes alive after all that. Modified: trunk/cumin/bin/cumin =================================================================== --- trunk/cumin/bin/cumin 2013-01-18 15:32:12 UTC (rev 5657) +++ trunk/cumin/bin/cumin 2013-01-18 18:02:31 UTC (rev 5658) @@ -320,17 +320,18 @@ % app[PROCESS].pid) # Wait for children to exit so we can record the result and so that - # the initd script can track the pid correctly. Exit regardless after + # the initd script can track the pid correctly. Issue SIGKILL after # timeout+0.5 seconds, even if the children are still running. Each child # will have a hard exit after timeout seconds from its own main thread, # even if its shutdown activities have not completed, so in most # cases the children *should* exit within timeout+0.5 seconds then = datetime.now() + sent_kill = False while complete != 0: for app in apps: poll = app[PROCESS] and app[PROCESS].poll() if poll is not None: - log.info("Subprocess (%s) exited", app[PROCESS].pid) + log.info("Subprocess (%s) exited (%s)" % (app[PROCESS].pid, poll)) app[PROCESS] = None complete -= 1 @@ -340,12 +341,24 @@ sleep(0.25) if datetime.now() - then > timedelta(seconds=options.timeout+0.5): - log.warn("Timed out waiting for children, exiting") - for app in apps: # just to be paranoid - if app[PROCESS]: - log.warn("Subprocess (%s) failed to stop"\ - % app[PROCESS].pid) - break + if sent_kill: + log.warn("Timed out waiting for children, exiting") + for app in apps: + if app[PROCESS]: + log.warn("Subprocess (%s) failed to stop"\ + % app[PROCESS].pid) + break + else: + log.warn("Timed out waiting for children, sending SIGKILL") + for app in apps: # just to be paranoid + if app[PROCESS]: + os.kill(app[PROCESS].pid, signal.SIGKILL) + log.warn("Subprocess (%s) failed to stop, "\ + "sending SIGKILL" % (app[PROCESS].pid)) + + # Okay, let's loop one more time and report + sent_kill = True + then = datetime.now() write_status_path(config.get_init_status_path(), "exit\n", append=True) return return_code
Update on testing and log content. 1) Normal termination of the cumin service with a default configuration will produce log messages like the following in /var/log/cumin/master.log. The only difference here is the addition of the exit status (all are (0) below): service cumin start (wait) service cumin stop look in /var/log/cumin/master.log 10728 2013-01-18 13:51:22,534 INFO Write termination string to all children 10728 2013-01-18 13:51:23,537 INFO Subprocess (10741) exited (0) 10728 2013-01-18 13:51:23,537 INFO Subprocess (10743) exited (0) 10728 2013-01-18 13:51:24,039 INFO Subprocess (10740) exited (0) 10728 2013-01-18 13:51:24,541 INFO Subprocess (10742) exited (0) 10728 2013-01-18 13:51:24,542 INFO Subprocess (10744) exited (0) 10728 2013-01-18 13:51:26,045 INFO Subprocess (10739) exited (0) 10728 2013-01-18 13:51:26,046 INFO All children exited Different ways to cause an abnormal termination case: ----------------------------------------------------- 2) To produce a case with the default installation where the master actually times out and the child is killed, we need a real case with a "hung" child. Bug 895689 would be a good candidate for this (seems to be relatively easy to reproduce on RHEL 6) It seems to help when child processes are very busy. For example, use a pool with a lot of activity and a bunch of wallaby nodes/features. tail -f /var/log/cumin/web.log with debug level logging on and wait until WallabyOperations has started to refresh nodes, features, tags, etc. Then 'service cumin stop' 3) The odds of an abnormal termination case may be increased by shortening the master timeout to 0. This can be done by putting the following line in /etc/sysconfig/cumin: CUMIN_OPTIONS="--timeout=0" This will cause the master to wait for 0 seconds before issuing SIGKILL to children who have not stopped. However, it will also cause the children to only wait 0 seconds before exiting themselves with sys.exit, so some children may complete with a normal termination status (0) and not exit do to the signal. service cumin start (wait) service cumin stop look in var/log/cumin/master.log (in this example the master noticed that 11706 and 11711 had not exited and issued SIGKILL, then waited for exit status. These two processes actually finished their normal termination and so exited with status 0). 11695 2013-01-18 14:06:07,316 INFO Write termination string to all children 11695 2013-01-18 14:06:07,567 INFO Subprocess (11707) exited (0) 11695 2013-01-18 14:06:07,567 INFO Subprocess (11708) exited (0) 11695 2013-01-18 14:06:07,567 INFO Subprocess (11709) exited (0) 11695 2013-01-18 14:06:07,568 INFO Subprocess (11710) exited (0) 11695 2013-01-18 14:06:07,818 WARNING Timed out waiting for children, sending SIGKILL 11695 2013-01-18 14:06:07,819 WARNING Subprocess (11706) failed to stop, sending SIGKILL 11695 2013-01-18 14:06:07,819 WARNING Subprocess (11711) failed to stop, sending SIGKILL 11695 2013-01-18 14:06:07,819 INFO Subprocess (11706) exited (0) 11695 2013-01-18 14:06:07,820 INFO Subprocess (11711) exited (0) 11695 2013-01-18 14:06:07,820 INFO All children exited 4) If a whitebox test is okay, the following is another option. Modify the /usr/bin/cumin script to always pass a timeout value of 5 to the children. Use the /etc/sysconfig/cumin file to change the master's timeout to 0. This will cause the master to timeout immediately but the children will still be executing their shutdown procedures for up to 5 seconds. a) This sed command will change the following line which sets the child timeout args = [app, "--section="+section.strip(), "--es="+es.strip(), "--tm="+str(timeout)] # sed -i s/str\(timeout\)/str\(5\)/ /usr/bin/cumin b) Set the following value in /etc/sysconfig/cumin: CUMIN_OPTIONS="--timeout=0" c) service cumin start (wait) service cumin stop Look in /var/log/cumin/master.log for messages like the following. I included the last line of child startup to verify that the children have a timeout value of 5 after the sed command. Notice the status of each child is -9 which indicates termination from a signal. ... 12402 2013-01-18 14:34:47,438 INFO Started subprocess (pid 12418): cumin-data --section=data.other --es=exit --tm=5 --daemon 12402 2013-01-18 14:34:54,739 INFO Write termination string to all children 12402 2013-01-18 14:34:55,242 WARNING Timed out waiting for children, sending SIGKILL 12402 2013-01-18 14:34:55,243 WARNING Subprocess (12413) failed to stop, sending SIGKILL 12402 2013-01-18 14:34:55,244 WARNING Subprocess (12414) failed to stop, sending SIGKILL 12402 2013-01-18 14:34:55,244 WARNING Subprocess (12415) failed to stop, sending SIGKILL 12402 2013-01-18 14:34:55,252 WARNING Subprocess (12416) failed to stop, sending SIGKILL 12402 2013-01-18 14:34:55,258 WARNING Subprocess (12417) failed to stop, sending SIGKILL 12402 2013-01-18 14:34:55,261 WARNING Subprocess (12418) failed to stop, sending SIGKILL 12402 2013-01-18 14:34:55,262 INFO Subprocess (12414) exited (-9) 12402 2013-01-18 14:34:55,262 INFO Subprocess (12415) exited (-9) 12402 2013-01-18 14:34:55,262 INFO Subprocess (12416) exited (-9) 12402 2013-01-18 14:34:55,262 INFO Subprocess (12417) exited (-9) 12402 2013-01-18 14:34:55,263 INFO Subprocess (12418) exited (-9) 12402 2013-01-18 14:34:55,513 INFO Subprocess (12413) exited (-9) 12402 2013-01-18 14:34:55,514 INFO All children exited
fyi, here is a log of a shutdown with a real occurrence of Bug 895689 in the wild using a local build of the rpm: 14757 2013-01-18 15:15:16,902 INFO Started subprocess (pid 14768): cumin-web --section=web --es=exit --tm=5 --daemon 14757 2013-01-18 15:15:16,905 INFO Started subprocess (pid 14769): cumin-data --section=data.grid --es=exit --tm=5 --daemon 14757 2013-01-18 15:15:16,924 INFO Started subprocess (pid 14770): cumin-data --section=data.grid-slots --es=exit --tm=5 --daemon 14757 2013-01-18 15:15:16,934 INFO Started subprocess (pid 14771): cumin-data --section=data.grid-submissions --es=exit --tm=5 --daemon 14757 2013-01-18 15:15:16,944 INFO Started subprocess (pid 14772): cumin-data --section=data.sesame --es=exit --tm=5 --daemon 14757 2013-01-18 15:15:16,953 INFO Started subprocess (pid 14773): cumin-data --section=data.other --es=exit --tm=5 --daemon 14757 2013-01-18 15:18:49,604 INFO Write termination string to all children 14757 2013-01-18 15:18:50,607 INFO Subprocess (14770) exited (0) 14757 2013-01-18 15:18:50,738 INFO Subprocess (14771) exited (0) 14757 2013-01-18 15:18:50,990 INFO Subprocess (14772) exited (0) 14757 2013-01-18 15:18:51,241 INFO Subprocess (14769) exited (0) 14757 2013-01-18 15:18:54,750 INFO Subprocess (14773) exited (0) 14757 2013-01-18 15:18:55,252 WARNING Timed out waiting for children, sending SIGKILL 14757 2013-01-18 15:18:55,252 WARNING Subprocess (14768) failed to stop, sending SIGKILL 14757 2013-01-18 15:18:55,503 INFO Subprocess (14768) exited (-9) 14757 2013-01-18 15:18:55,504 INFO All children exited Entries from the web.log showing failure to stop and the new process after the restart: 14768 2013-01-18 15:18:49,627 INFO Stopping Cumin 14768 2013-01-18 15:18:49,627 DEBUG WallabyOperations: waiting for cache maintenance thread to exit 14768 2013-01-18 15:18:50,387 INFO Request GET /index.update?session=index.html%3Fframe%3Dmain.grid.tag%3Bmain.grid.m%3Dtag%3Bmain.grid.id%3D24%3Bmain.grid.view.body.m%3Dtagi%3Bmain.grid.tag.m%3Dview%3Bmain.grid.tag.id%3Dfoobar;widget=main.notifications;widget=main.grid.tag.view.body.tagoverview.taggen 14768 2013-01-18 15:18:50,393 INFO Response 200 OK 14768 2013-01-18 15:18:50,394 DEBUG Response headers: 14768 2013-01-18 15:18:50,394 DEBUG Content-Length 1173 14768 2013-01-18 15:18:50,394 DEBUG Content-Type text/xml 14768 2013-01-18 15:18:50,394 DEBUG Cache-Control no-cache 14768 2013-01-18 15:18:54,627 INFO Shutdown thread timed out, exiting 15107 2013-01-18 15:19:04,443 INFO Checking Cumin
Tested along to comment 18 and 19 on RHEL 5.9/6.4 - i386/x86_64 with cumin-0.1.5658-1. It behaves as expected - service cumin stop doesn't finish with OK status until the service is properly stopped (or killed after timeout - see comment 14 and 15). # tail -F /var/log/cumin/master.log 10179 2013-01-31 13:57:43,257 INFO Write termination string to all children 10179 2013-01-31 13:57:43,778 INFO Subprocess (10194) exited (0) 10179 2013-01-31 13:57:44,282 INFO Subprocess (10190) exited (0) 10179 2013-01-31 13:57:44,283 INFO Subprocess (10191) exited (0) 10179 2013-01-31 13:57:44,542 INFO Subprocess (10192) exited (0) 10179 2013-01-31 13:57:44,542 INFO Subprocess (10195) exited (0) 10179 2013-01-31 13:57:44,797 INFO Subprocess (10193) exited (0) 10179 2013-01-31 13:57:44,797 INFO All children exited # tail -F /var/log/cumin/web.log 17026 2013-01-31 14:08:13,606 DEBUG WallabyOperations: refreshing features 17026 2013-01-31 14:08:28,160 DEBUG Exiting on termination string from master 17026 2013-01-31 14:08:28,180 INFO Stopping Cumin 17026 2013-01-31 14:08:28,181 INFO Shutdown thread timed out, exiting # tail -F /var/log/cumin/master.log 17002 2013-01-31 14:07:45,717 INFO Started subprocess (pid 17026): cumin-web --section=web --es=exit --tm=0 --daemon 17002 2013-01-31 14:07:45,744 INFO Started subprocess (pid 17027): cumin-data --section=data.grid --es=exit --tm=0 --daemon 17002 2013-01-31 14:07:45,770 INFO Started subprocess (pid 17028): cumin-data --section=data.grid-slots --es=exit --tm=0 --daemon 17002 2013-01-31 14:07:45,846 INFO Started subprocess (pid 17029): cumin-data --section=data.grid-submissions --es=exit --tm=0 --daemon 17002 2013-01-31 14:07:45,960 INFO Started subprocess (pid 17030): cumin-data --section=data.sesame --es=exit --tm=0 --daemon 17002 2013-01-31 14:08:28,159 INFO Write termination string to all children 17002 2013-01-31 14:08:28,410 INFO Subprocess (17027) exited (0) 17002 2013-01-31 14:08:28,410 INFO Subprocess (17028) exited (0) 17002 2013-01-31 14:08:28,411 INFO Subprocess (17030) exited (0) 17002 2013-01-31 14:08:28,661 WARNING Timed out waiting for children, sending SIGKILL 17002 2013-01-31 14:08:28,662 WARNING Subprocess (17026) failed to stop, sending SIGKILL 17002 2013-01-31 14:08:28,662 WARNING Subprocess (17029) failed to stop, sending SIGKILL 17002 2013-01-31 14:08:28,662 INFO Subprocess (17029) exited (0) 17002 2013-01-31 14:08:28,920 INFO Subprocess (17026) exited (-9) 17002 2013-01-31 14:08:28,920 INFO All children exited # tail -F /var/log/cumin/web.log 21730 2013-01-31 13:36:09,375 DEBUG WallabyOperations: refreshing features 21730 2013-01-31 13:36:25,651 DEBUG Exiting on termination string from master 21730 2013-01-31 13:36:25,668 INFO Stopping Cumin 21730 2013-01-31 13:36:25,668 DEBUG WallabyOperations: waiting for cache maintenance thread to exit 21730 2013-01-31 13:36:30,690 INFO Shutdown thread timed out, exiting # tail -F /var/log/cumin/master.log 21712 2013-01-31 13:12:47,751 INFO Started subprocess (pid 21730): cumin-web --section=web --es=exit --tm=5 --daemon 21712 2013-01-31 13:12:47,778 INFO Started subprocess (pid 21732): cumin-data --section=data.grid --es=exit --tm=5 --daemon 21712 2013-01-31 13:12:47,817 INFO Started subprocess (pid 21734): cumin-data --section=data.grid-slots --es=exit --tm=5 --daemon 21712 2013-01-31 13:12:47,984 INFO Started subprocess (pid 21735): cumin-data --section=data.grid-submissions --es=exit --tm=5 --daemon 21712 2013-01-31 13:12:48,164 INFO Started subprocess (pid 21739): cumin-data --section=data.sesame --es=exit --tm=5 --daemon 21712 2013-01-31 13:36:25,622 INFO Write termination string to all children 21712 2013-01-31 13:36:26,139 INFO Subprocess (21739) exited (0) 21712 2013-01-31 13:36:26,390 INFO Subprocess (21735) exited (0) 21712 2013-01-31 13:36:26,640 INFO Subprocess (21732) exited (0) 21712 2013-01-31 13:36:26,897 INFO Subprocess (21734) exited (0) 21712 2013-01-31 13:36:31,199 WARNING Timed out waiting for children, sending SIGKILL 21712 2013-01-31 13:36:31,199 WARNING Subprocess (21730) failed to stop, sending SIGKILL 21712 2013-01-31 13:36:31,200 INFO Subprocess (21730) exited (0) 21712 2013-01-31 13:36:31,200 INFO All children exited >>> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html