Bug 681124 - Cumin does not start on system boot
Summary: Cumin does not start on system boot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: cumin
Version: 1.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.0
: ---
Assignee: Trevor McKay
QA Contact: Jan Sarenik
URL:
Whiteboard:
Depends On:
Blocks: 693778
TreeView+ depends on / blocked
 
Reported: 2011-03-01 07:45 UTC by Jan Sarenik
Modified: 2011-06-23 15:40 UTC (History)
3 users (show)

Fixed In Version: cumin-0.1.4672-1
Doc Type: Bug Fix
Doc Text:
Cause During bootup nn some systems the cumin database server may not be fully functional by the time the cumin service is started and checks the state of the database. Consequence This can result in /sbin/service cumin start reporting that the database has not been created, when in fact it actually has. In this case, starting cumin after a short delay (a few seconds) will actually work. Fix If cumin detects that the database server process is actually running but it cannot make a connection to the server, it will retry the connection for up to 30 seconds before it reports an error. The full thirty seconds should not be needed; a successful connection should be made within a few seconds. Normally the connection will be made on the first attempt and there will be no additional delay at all. Result Cumin should now correctly detect and handle the case where the database server is not fully functional at the time when the cumin service is started.
Clone Of:
Environment:
Last Closed: 2011-06-23 15:40:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0889 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.0 Release 2011-06-23 15:35:53 UTC

Description Jan Sarenik 2011-03-01 07:45:19 UTC
When the services are set up with 'chkconfig <service> on'
to start on system boot with the default ordering values,
cumin starts right after the PostgreSQL and does not notice
the database is still initializing.

cumin-0.1.4560-1.el5 and many before

How reproducible: 100% on system boot

Steps to Reproduce:
1. Install and setup cumin along with PgSQL
2. Reboot the machine

Actual results: See Cumin was not started during boot.

------------------------------------------------
  ...
  Starting ntpd:                  [  OK  ]
  Starting postgresql service:    [  OK  ]
  Cumin's database is not yet installed
  Run 'cumin-database install' as root
  Starting Sesame daemon:         [  OK  ]
  ...
------------------------------------------------

Expected results: Cumin started correctly during boot.

Additional info: Either the database check should be
  changed to honor a not-yet-started SQL master or
  the order in which Cumin starts by default should
  be moved.

Comment 2 Trevor McKay 2011-03-04 17:11:28 UTC
I am having trouble reproducing this, unless I skip the "cumin-dabase install" step after installing the packages.

So after the machine boots, and you discover that cumin is not running, what do you have to do to make it run?

Please change the following line and run again.  This is the call that is generating the error based on the text above; the output may be helpful:

line 23 in /etc/init.d/cumin, remove the "&> /dev/null", so it looks like

cumin-database check || {

Comment 3 Trevor McKay 2011-03-04 17:12:49 UTC
(In reply to comment #2)
> I am having trouble reproducing this, unless I skip the "cumin-dabase install"
> step after installing the packages.

that would be "cumin-database install", actually :)

Comment 4 Jan Sarenik 2011-03-07 16:38:51 UTC
I am sorry, the reproducibility is not 100%.
A while ago I reboted five times and two of
them were positive (cumin did not start).

Comment 6 Jan Sarenik 2011-03-08 14:38:08 UTC
Trevor, would it be possible to simply change the default chkconfig
start order for cumin so that it starts sometime after qpidd? I am sure
that would clear out any PostgreSQL timing issues we are coping with ATM.

BTW the recipe (the export-users part) has nothing to do with it,
I can reproduce without it as well.

Comment 7 Trevor McKay 2011-03-09 13:42:22 UTC
(In reply to comment #6)
> Trevor, would it be possible to simply change the default chkconfig
> start order for cumin so that it starts sometime after qpidd? I am sure
> that would clear out any PostgreSQL timing issues we are coping with ATM.
> 

Jan, 

This is certainly possible but I would like to avoid this and try to find out what the root cause is.  Moving cumin further away in time from Postgres masks the underlying issue, and it could bite us again later.

However, if you can verify that this actually works, then we can use it as a fallback until we can find the root cause.  Will you run some tests with the cumin start pushed further toward the end of the init sequence, and verify that the issue goes away?

Comment 8 Jan Sarenik 2011-03-09 13:57:12 UTC
Yes, I will do that today and report back with results.

Comment 9 Jan Sarenik 2011-03-09 16:57:42 UTC
It seems everything behaves normally when cumin starts last (99).
A simple toggle script:
-------------------------------------------------------------------
if
  grep "chkconfig:" /etc/init.d/cumin | grep -q 80
then
  sed -i '/chkconfig:/ c# chkconfig: 2345 99 30' /etc/init.d/cumin
else
  sed -i '/chkconfig:/ c# chkconfig: 2345 80 30' /etc/init.d/cumin
fi
chkconfig cumin off
chkconfig cumin on
ls /etc/rc3.d/*cumin
-------------------------------------------------------------------

Comment 14 Jan Sarenik 2011-03-29 07:40:09 UTC
This is what happens with cumin-0.1.4669-1.el5 on boot:
-----------------------------------------------------------------------
ntpd: Synchronizing with time server:                      [  OK  ]
Starting ntpd:                                             [  OK  ]
Starting postgresql service:                               [  OK  ]
Cumin's database is not yet installed
Run 'cumin-database install' as root
(detailed output from cumin-database check:)
Checking environment ........ OK
Checking initialization ..... OK
Checking configuration ...... OK
Checking server ............. OK
Checking database 'cumin' ... Error: The database is not created
Hint: Run 'cumin-database create'
-----------------------------------------------------------------------

When I start it a minute later without doing any changes,
everything works

Comment 15 Trevor McKay 2011-03-29 20:53:44 UTC
Possible fix in 4672.

Added check-created-wait function that retries once a second for up to 30 seconds if output from psql indicates that a connection to the server could not be established.  This function is called from 'cumin-database check' after an existing check to see if the postgres server is running.

Based on Jan's comments above, it must be the case that after the server starts there may be an amount of time on some systems before psql commands can establish a connection to the server.  A psql command is used to determine whether or not the database is created.  Retrying for up to 30 seconds hopefully will give postgres time enough to spin up.  If it still fails we can make the time longer.

Comment 16 Jan Sarenik 2011-03-30 08:54:05 UTC
Verified in cumin-0.1.4672-1.el5

I think 30 seconds will never be reached as Cumin needs maybe less
than a second after postgresql is started to establish a connection.
I mentioned that "minute" in Comment #14 just as an example of "later".
Anyway, 30 seconds limit can stay there, shouldn't hurt anyone.

Comment 17 Trevor McKay 2011-03-30 19:43:49 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    During bootup nn some systems the cumin database server may not be fully functional by the time the cumin service is started and checks the state of the database.

Consequence
    This can result in /sbin/service cumin start reporting that the database has not been created, when in fact it actually has.  In this case, starting cumin after a short delay (a few seconds) will actually work.

Fix
    If cumin detects that the database server process is actually running but it cannot make a connection to the server, it will retry the connection for up to 30 seconds before it reports an error.  The full thirty seconds should not be needed; a successful connection should be made within a few seconds.  Normally the connection will be made on the first attempt and there will be no additional delay at all.

Result
    Cumin should now correctly detect and handle the case where the database server is not fully functional at the time when the cumin service is started.

Comment 18 errata-xmlrpc 2011-06-23 15:40:26 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html


Note You need to log in before you can comment on or make changes to this bug.