Bug 801287

Summary: service cumin start missing pid file
Product: Red Hat Enterprise MRG Reporter: Stanislav Graf <sgraf>
Component: cuminAssignee: Trevor McKay <tmckay>
Status: CLOSED ERRATA QA Contact: Peter Belanyi <pbelanyi>
Severity: unspecified Docs Contact:
Priority: low    
Version: DevelopmentCC: athomas, ltoscano, matt, mkudlej, tmckay
Target Milestone: 2.3   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cumin-0.1.5388-2 Doc Type: Bug Fix
Doc Text:
Cause Cumin did not make use of a pid file. Consequence The missing pid file makes determining the true status of the cumin service more difficult. Fix Cumin now uses the /var/run/cumin.pid file. Result The pid file is created when the service is started and deleted when the service is stopped by initd. If the cumin service is not running but /var/run/cumin.pid exists, it is evidence of a program crash.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-06 18:42:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stanislav Graf 2012-03-08 07:58:51 UTC
Description of problem:
service cumin start
won't touch /var/run/cumin.pid file

Version-Release number of selected component (if applicable):
cumin-0.1.5233-1.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1.service cumin start
2.[[ ! -f /var/run/cumin.pid ]] && echo "missing /var/run/cumin.pid file"


Actual results:
missing /var/run/cumin.pid

Expected results:
when "service cumin {start|restart}" is issued, file /var/run/cumin.pid
should be created
On "service cumin stop" the file /var/run/cumin.pid is removed gracefully 

Additional info:

Comment 1 Trevor McKay 2012-05-17 20:11:18 UTC
Fixed in revision 5382.

Cumin now uses a pidfile (/var/run/cumin.pid).  On service start, the initd script creates a blank pid file owned by the 'cumin' user.  The cumin master script fills in the pid value when it starts if the "--p" option is passed.  The pidfile is deleted by the initd script on 'service stop' after the cumin master script exits.

To handle synchronization between initd and /usr/bin/cumin, a new /usr/sbin/cumin-checkpid executable has been added along with $CUMIN_HOME/log/.*.init files for all of the cumin processes (named by config section).

Cumin processes write startup status to the $CUMIN_HOME/log/.*.init files.  The /usr/bin/cumin-checkpid script checks for the pid value in /var/run/cumin.pid and a status value in the $CUMIN_HOME/log/.master.init file.  The cumin master script writes its status value after all of the subprocesses have passed their init checks and written their own files, or there has been a failure.  

The initd script calls /usr/bin/cumin-checkpid to find out the status of the service start and to wait for the cumin process to end on a service stop.  The double-start of cumin for init checks has been eliminated.  The initd script will delete the pidfile on a failed startup after the cumin master script has exited.  Therefore, if the cumin service is not running and there is a pidfile left over, it is the result of an unexpected crash (and not normal startup checks).

Comment 2 Trevor McKay 2012-05-17 20:18:32 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    Cumin did not make use of a pid file.

Consequence
    The missing pid file makes determining the true status of the cumin service more difficult.

Fix
    Cumin now uses the /var/run/cumin.pid file.

Result
    The pid file is created when the service is started and deleted when the service is stopped by initd. If the cumin service is not running but /var/run/cumin.pid exists, it is evidence of a program crash.

Comment 3 Trevor McKay 2012-05-17 20:39:42 UTC
A note on testing:

The /etc/sysconfig/cumin file can be used to create startup failures in the master script and in the cumin child processes (cumin-data, cumin-web, and cumin-report).  Setting bad options and arguments here will cause them to exit during init checks so that creation/deletion of the pidfile can be seen (and synchronization of initd and the service exit).

The /etc/sysconfig/cumin file may contain a line like this.  It defines the options passed to /usr/bin/cumin:

CUMIN_OPTIONS="--web-options='--section=web4'"

In this case, we are defining --web-options to be passed as extra options to every cumin-web instance.  Note the use of quotes, very important.  A line like the one above can be used to make the cumin-web instances fail init checks because of a missing config section.

Here are a few more:

# make the data instances error
CUMIN_OPTIONS="--data-options='--some_bad_option'"

# make the report instance fail (if running)
CUMIN_OPTIONS-"--report-options='--some_bad_option'"

# make the master script itself error
CUMIN_OPTIONS="--some_bad_option"

CUMIN_OPTINS="extra_args"

Comment 6 Trevor McKay 2012-10-03 18:13:30 UTC
fyi, here is an interesting case.  It took cumin-web too long to shutdown so eventually cumin-web did a sysexit and everything stopped.  But, the init script timed out and left the pidfile.  "service status" indicates that maybe shutdown didn't happen correctly.  This is the expected behavior in this scenario.

# service cumin stop
Stopping cumin:                                            [  OK  ]
Timed out, cumin may not have stopped completely.
# more /var/run/cumin.pid 
6023
# service cumin status
cumin dead but pid file exists

from web.log:
-------------

6026 2012-10-03 14:04:28,223 INFO Shutdown thread timed out, exiting

from master.log (note the timestamps on the last two entries)
---------------

6023 2012-10-03 13:47:48,326 INFO Started subprocess (pid 6026):  cumin-web --section=web --es=exit --tm=5 --daemon
6023 2012-10-03 13:47:48,344 INFO Started subprocess (pid 6027):  cumin-data --section=data.grid --es=exit --tm=5 --daemon
6023 2012-10-03 13:47:48,357 INFO Started subprocess (pid 6028):  cumin-data --section=data.grid-slots --es=exit --tm=5 --daemon
6023 2012-10-03 13:47:48,363 INFO Started subprocess (pid 6029):  cumin-data --section=data.grid-submissions --es=exit --tm=5 --daemon
6023 2012-10-03 13:47:48,501 INFO Started subprocess (pid 6030):  cumin-data --section=data.sesame --es=exit --tm=5 --daemon
6023 2012-10-03 14:04:23,220 INFO Write termination string to all children
6023 2012-10-03 14:04:23,471 INFO Subprocess (6028) exited
6023 2012-10-03 14:04:24,223 INFO Subprocess (6030) exited
6023 2012-10-03 14:04:24,474 INFO Subprocess (6027) exited
6023 2012-10-03 14:04:24,474 INFO Subprocess (6029) exited
6023 2012-10-03 14:04:28,483 INFO Subprocess (6026) exited
6023 2012-10-03 14:04:28,483 INFO All children exited

Comment 7 Trevor McKay 2012-10-03 18:42:04 UTC
Re Comment 3,

if /etc/sysconfig/cumin is used to create one of the startup failure scenarios listed there (one is enough), then a tight bash loop can be used to check for the existence of /var/run/cumin.pid.  It will come and go.  I did it by hand with multiple windows and command history :)

By the way, here is expected output for an init-check failure of cumin-web (starting with an empty log directory)

(set this in /etc/sysconfig/cumin to make it fail)
CUMIN_OPTIONS="--web-options='--somebadoption'"

# service cumin start
Starting cumin:                                            [FAILED]

# more /var/log/cumin/master.log 
7277 2012-10-03 14:30:14,589 INFO Started subprocess (pid 7280):  cumin-web --section=web --es=exit --tm=5 --daemon --somebadoption
7277 2012-10-03 14:30:14,613 INFO Started subprocess (pid 7281):  cumin-data --section=data.grid --es=exit --tm=5 --daemon
7277 2012-10-03 14:30:14,714 INFO Started subprocess (pid 7282):  cumin-data --section=data.grid-slots --es=exit --tm=5 --daemon
7277 2012-10-03 14:30:14,720 INFO Started subprocess (pid 7283):  cumin-data --section=data.grid-submissions --es=exit --tm=5 --daem
on
7277 2012-10-03 14:30:14,973 INFO Started subprocess (pid 7284):  cumin-data --section=data.sesame --es=exit --tm=5 --daemon
7277 2012-10-03 14:30:15,474 ERROR Subprocess (7280) failed init checks with status 3 (parse error), error in options, arguments, or
 config values
7277 2012-10-03 14:30:15,474 INFO Subprocess logs may contain more details.
7277 2012-10-03 14:30:15,474 INFO Stopping cumin
7277 2012-10-03 14:30:15,475 INFO Write termination string to all children
7277 2012-10-03 14:30:15,730 INFO Subprocess (7283) exited
7277 2012-10-03 14:30:15,981 INFO Subprocess (7281) exited
7277 2012-10-03 14:30:15,981 INFO Subprocess (7282) exited
7277 2012-10-03 14:30:15,982 INFO Subprocess (7284) exited
7277 2012-10-03 14:30:15,982 INFO All children exited

# more /var/log/cumin/web.stderr 
usage: cumin-web [options]

cumin-web: error: no such option: --somebadoption
7280 2012-10-03 14:30:15,183 ERROR Error in options

# more /var/log/cumin/.*.init
::::::::::::::
.data.grid.init
::::::::::::::
0
::::::::::::::
.data.grid-slots.init
::::::::::::::
0
::::::::::::::
.data.grid-submissions.init
::::::::::::::
0
::::::::::::::
.data.sesame.init
::::::::::::::
0
::::::::::::::
.master.init
::::::::::::::
3
exit
::::::::::::::
.web.init
::::::::::::::

Comment 9 Peter Belanyi 2013-01-16 09:42:10 UTC
I was able to reproduce on cumin-0.1.5192-4

Verified on RHEL5 and RHEL6, both i386 and x86_64, with cumin-0.1.5648-1

--

During the verification I also tried the cases mentioned in comment 3.

When using this option:

  # make the report instance fail (if running)
  CUMIN_OPTIONS-"--report-options='--some_bad_option'"

starting cumin failed on RHEL6, but it was succesful on RHEL5. The reason is that this bad option should be parsed by cumin-report subprocess, but on RHEL5 it is not started at all. As far as I know this is the expected behaviour, so setting this bz as verified.

Comment 12 errata-xmlrpc 2013-03-06 18:42:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html