Bug 781561 - aeolus-conductor fails to start after system reboot -- Service Temporarily Unavailable
Summary: aeolus-conductor fails to start after system reboot -- Service Temporarily Un...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: CloudForms Cloud Engine
Classification: Retired
Component: iwhd
Version: 1.0.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: 1.0.2
Assignee: Pete Zaitcev
QA Contact: Brad P. Crochet
URL:
Whiteboard:
Depends On:
Blocks: 786295
TreeView+ depends on / blocked
 
Reported: 2012-01-13 18:07 UTC by James Laska
Modified: 2013-10-14 20:07 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 786295 (view as bug list)
Environment:
Last Closed: 2013-10-14 20:07:53 UTC


Attachments (Terms of Use)
boot log (8.22 KB, text/plain)
2012-01-13 20:01 UTC, James Laska
no flags Details

Description James Laska 2012-01-13 18:07:30 UTC
Description of problem:

After rebooting a properly configured aeolus-conductor system, conductor does not work.  

Aeolus-check-services shows all service are running as expected.  The problem can be corrected by running: aeolus-restart-services.  More fine grained, the problem is resolved by restarting httpd.

This implies that the service stop/start order used by aeolus-restart-services works, while the default initscript stop/start order does not.

Version-Release number of selected component (if applicable):
 * aeolus-all-0.8.0-5.el6.src.rpm
 * aeolus-conductor-0.8.0-5.el6.src.rpm
 * aeolus-configure-2.5.0-3.el6.src.rpm
 * rubygem-aeolus-cli-0.3.0-3.el6.src.rpm
 * rubygem-aeolus-image-0.3.0-2.el6.src.rpm

How reproducible:
 * 100%

Steps to Reproduce:
1. Install and configure a working conductor instance
2. Reboot the system
3. Attempt to access the conductor web-ui
  
Actual results:


Expected results:


Additional info:

 * The order aeolus-restart-services operates ...

    == aeolus-restart-services stop order ==
    ntpd
    imagefactory
    conductor-dbomatic
    aeolus-conductor
    libvirtd
    deltacloud-core
    qpidd
    httpd
    postgresql
    iwhd
    mongod

    == aeolus-restart-services start order ==
    mongod
    iwhd
    postgresql
    httpd
    qpidd
    deltacloud-core
    libvirtd
    aeolus-conductor
    conductor-dbomatic
    imagefactory
    ntpd

 * The order the each service stops and starts on a reboot ...

    == chkconfig start order ==
    20mongod
    40iwhd
    58ntpd
    64postgresql
    85httpd  <-- DIFFERENT -->
    85qpidd  <-- DIFFERENT -->
    90deltacloud-core
    97libvirtd
    99aeolus-conductor
    99conductor-dbomatic
    99imagefactory

    == chkconfig stop order ==
    01aeolus-conductor
    01conductor-dbomatic
    01imagefactory
    03libvirtd
    15httpd
    15qpidd
    36postgresql
    60deltacloud-core
    60iwhd
    74ntpd
    80mongod

Comment 1 James Laska 2012-01-13 18:24:00 UTC
Apologies, prematurely clicked <submit>

Actual results:

# aeolus-image list --images
ERROR:   => Failed.  Response code = 503.  Response message = Service Temporarily Unavailable.

Expected results:

# aeolus-image list --images
ID                                       Name             OS         OS Version     Arch       Description     
------------------------------------     ------------     ------     ----------     ------     -----------     
4da54731-eaa4-4b14-9452-df4e0b5f78f3     configserver     RHEL-6     2              x86_64                     
109bbb05-1685-4dba-8c71-39ab3a2b9c38     client           RHEL-6     2              x86_64                     
63028c1f-841f-46a6-975f-df4038f8370d     client           RHEL-6     2              x86_64

Comment 2 James Laska 2012-01-13 20:01:23 UTC
Created attachment 555137 [details]
boot log

I suspect this might be related to conductor (or related component) not shutting down properly?

Comment 3 John Eckersberg 2012-01-17 19:15:33 UTC
I'm not able to reproduce this on a fresh installation.

The first error I see in the boot log is:

waiting for mongod to listen on localhost:27017waiting for mongod to listen on localhost:27017[FAILED]

If mongod is not starting then that certainly can explain the issue on reboot.  Is there anything of note in /var/log/mongodb/mongodb.log ?

Comment 4 James Laska 2012-01-17 19:53:42 UTC
(In reply to comment #3)
> I'm not able to reproduce this on a fresh installation.

I'm having a hell of a time reproducing it now also.  I noticed I was getting a kernel panic on reboot for this system, which caused the system to improperly shutdown.  It's entirely possible that might have resulted in an incomplete service shutdown for conductor (or one of it's dependencies).

If the components are not properly shutdown, that might also explain why they failed to start up.

For example ...

 1. Start from a working setup where the conductor web-ui works fine
 2. echo b > /proc/sysrq-trigger
 3. After reboot, attempt to access conductor

I think this might point out that one (or more) of the initscripts may not properly handle their pid/lock files?

Comment 5 John Eckersberg 2012-01-17 20:40:41 UTC
OK, that helped a lot and let me reproduce it easily by just force halting my testing VM.

End result is that mongod fails to start on the next boot, here is the log:

Tue Jan 17 15:03:21 [initandlisten] MongoDB starting : pid=1426 port=27017 dbpath=/var/lib/mongodb 64-bit 
Tue Jan 17 15:03:21 [initandlisten] db version v1.8.0, pdfile version 4.5
Tue Jan 17 15:03:21 [initandlisten] git version: nogitversion
Tue Jan 17 15:03:21 [initandlisten] build sys info: Linux virtlab30.virt.bos.redhat.com 2.6.35.6-45.fc14.x86_64 #1 SMP Mon Oct 18 23:57:44 UTC 2010 x86_64 BOOST_LIB_VERSION=1_41
************** 
old lock file: /var/lib/mongodb/mongod.lock.  probably means unclean shutdown
recommend removing file and running --repair
see: http://dochub.mongodb.org/core/repair for more information
*************
Tue Jan 17 15:03:21 [initandlisten] exception in initAndListen std::exception: old lock file, terminating
Tue Jan 17 15:03:21 dbexit: 
Tue Jan 17 15:03:21 [initandlisten] shutdown: going to close listening sockets...
Tue Jan 17 15:03:21 [initandlisten] shutdown: going to flush diaglog...
Tue Jan 17 15:03:21 [initandlisten] shutdown: going to close sockets...
Tue Jan 17 15:03:21 [initandlisten] shutdown: waiting for fs preallocator...
Tue Jan 17 15:03:21 [initandlisten] shutdown: closing all files...
Tue Jan 17 15:03:21 closeAllFiles() finished
Tue Jan 17 15:03:21 dbexit: really exiting now

The reason aeolus-restart-services works is because it explicitly handles this case:

## Other checks

if perform(:status, ['mongod']) == [1]
  lockfile = '/var/lib/mongodb/mongod.lock'
  if File.exists?(lockfile)
    puts " \e[1;33mremoving\e[0m leftover #{lockfile}"
    File.delete(lockfile)
    perform :restart, %w(mongod iwhd)
  end
end


If you read the documentation at the given link (http://dochub.mongodb.org/core/repair) that behavior is frowned upon and may cause data loss.

One option is to configure mongod to use journaling, which should avoid the failed startup.  In mongod >= 1.9, this is the default; el6 has 1.8 so we would need to explicitly enable it.  The downside is that the first time mongod starts, it preallocates the journal which takes several minutes.  We would need to block iwhd startup for however long this takes.  This is less than ideal.

Another option would be to leave journaling off and introduce a script that runs early on in the init process (before mongod) that checks for the lockfile, and if it exists remove it and run a mongod --repair.  This is a total hack.

Comment 6 James Laska 2012-01-17 21:20:50 UTC
Thanks for the comments.  Just a quick note... as far as I can tell, RHEL6 has no mongodb yet (see bug#745171).  I believe we are carrying it in our CloudForms puddles at this time.  So if we wanted to update, that option may be available to us.

Some of this will all be fixed when using systemd properly anyway.  Unfortunately, that's not an option for RHEL6. :(

(In reply to comment #5)
> Another option would be to leave journaling off and introduce a script that
> runs early on in the init process (before mongod) that checks for the lockfile,
> and if it exists remove it and run a mongod --repair.  This is a total hack.

It seems appropriate to me that the mongodb initscript could be responsible for checking for the presence of a failed shutdown (no pid, but a lockfile exists) and automatically running --repair.  Would that be worth pursuing?

Comment 7 Mike Orazi 2012-01-17 22:10:37 UTC
per jrd's advice moving it to Jim for further comment and/or ideas for a final solution.

Comment 8 Jim Meyering 2012-01-24 11:37:43 UTC
So far, I like John Eckersberg's suggestion to turn on mongodb's journalling.
(add "journal = true" to /etc/mongodb.conf)

Assuming we use that approach, there are some issues:

What can we do about the ~3-minute (first-time-only) initialization
delay while mongodb actually writes 3GB of files?

    How about doing that as part of the already long-running
    vm-creation process?  It's hard to know how much that would
    add to the image size, but I suspect that the added 3GB would
    end up being very compressible.

Continue with mongodb-1.8.2, or upgrade to something newer?

    I am a little leery of enabling journalling in our out-of-date
    mongodb-1.8.2 (1.8.4 is latest in that series, with a week-old 1.8.5-rc0)
    even though searching for journalling-related problems in 1.8.x
    didn't find much.  IMHO, iwhd's use of mongodb is not likely
    to trigger many fringe bugs in mongodb, so simply using our
    tried and true 1.8.2 may be just fine.

    For the record, mongodb-2.0.2 is the "recommended" production release,
    per http://www.mongodb.org/downloads, and rawhide has mongodb-2.0.2-5.

Adjust iwhd's init script to wait longer for mongod to start?

    Currently it waits for only 2 seconds:

      # How many seconds to wait for mongod to become usable before giving up.
      MONGOD_N_SECONDS=2

    An alternative to the proposed off-line mongodb journal initialization
    would be to increase the maximum delay to say, 4 minutes, and to add
    checks ensuring that mongodb really is running in that case (just not
    yet responding) and to give reassuring diagnostics to help folks
    endure the long wait.  I do not like this.  2-4 minutes is interminable...

Finally, since we're considering rather fundamental changes
(enabling journalling in the relatively-old mongodb-1.8.2, and/or
upgrading to newer mongodb) to address the hard-to-hit problem
of an irregularly-terminated mongodb process, I have to wonder...
Should we consider another approach: simply documenting the
recovery procedure and make iwhd propose that via its log if/when
it finds that mongodb has not started in time?

Comment 9 jrd 2012-01-24 14:19:49 UTC
I vote we document the issue, including the fact that it's not easy to provoke, and let users recover manually.  I think at this point that's less risky than changing mogo versions or other big things.

Wes, Mike?

Comment 10 James Laska 2012-01-24 15:13:52 UTC
(In reply to comment #9)
> I vote we document the issue, including the fact that it's not easy to provoke,
> and let users recover manually.  I think at this point that's less risky than
> changing mogo versions or other big things.
> 
> Wes, Mike?

I defer to Wes, but that seems like an appropriate short-term solution.  We can document this for v1, and re-purpose this bug for a more appropriate long-term maintainable solution.

Comment 11 Jim Meyering 2012-02-02 16:40:59 UTC
This has been punted to 1.1, per jrd.

Comment 12 James Laska 2012-02-02 19:22:45 UTC
(In reply to comment #11)
> This has been punted to 1.1, per jrd.

I believe we want to keep the version as the version the bug was found.  We can adjust the cloudforms-X.Y.Z flag to target resolution for a different version.

To punt on this, you can supply a devel_ack- ... or clear the flags and re-propose for cloudforms-1.1.0?  The latter would just be easier, so I've done it.

Comment 13 Jim Meyering 2012-02-02 19:31:15 UTC
Thank you, James.

Comment 14 Jim Meyering 2012-02-06 17:30:02 UTC
FYI, mongod just got journalling enabled everywhere (F15,16, RHEL5,6)

Comment 15 Jim Meyering 2012-04-25 14:48:35 UTC
By virtue of the mongod upgrade+journalling change of two months ago, this should be much less of an issue.  There's still the possibility that someone using a pre-systemd system (rhel6) will hit the DB initialization problem on the very first service start, since that takes so much longer than the default 2-second timeout.  On second and subsequent starts, the DB should be created and the 2-second timeout should be fine.

Given the above and that no one has reported subsequent problems, I'm inclined to say we don't need a release note for this.

Comment 16 Hugh Brock 2012-05-02 19:29:46 UTC
removing relnote flag per Jim.

Comment 17 Mitch 2012-06-07 19:25:36 UTC
Let's revisit this and make a final decision on whether to make any changes or close.

Comment 18 Mike Orazi 2012-06-20 17:45:15 UTC
We can likely re-align this with next gen of imagefactory.

Comment 19 Mike Orazi 2012-08-21 20:43:25 UTC
Aligning to 2.0

Comment 20 Angus Thomas 2012-12-20 15:36:00 UTC
Dave,

Can you please arrange for this to be retested?

Comment 24 Pete Zaitcev 2013-10-14 20:07:53 UTC
This was ON_QA forever, closing.


Note You need to log in before you can comment on or make changes to this bug.