Hide Forgot
Description of problem: After rebooting a properly configured aeolus-conductor system, conductor does not work. Aeolus-check-services shows all service are running as expected. The problem can be corrected by running: aeolus-restart-services. More fine grained, the problem is resolved by restarting httpd. This implies that the service stop/start order used by aeolus-restart-services works, while the default initscript stop/start order does not. Version-Release number of selected component (if applicable): * aeolus-all-0.8.0-5.el6.src.rpm * aeolus-conductor-0.8.0-5.el6.src.rpm * aeolus-configure-2.5.0-3.el6.src.rpm * rubygem-aeolus-cli-0.3.0-3.el6.src.rpm * rubygem-aeolus-image-0.3.0-2.el6.src.rpm How reproducible: * 100% Steps to Reproduce: 1. Install and configure a working conductor instance 2. Reboot the system 3. Attempt to access the conductor web-ui Actual results: Expected results: Additional info: * The order aeolus-restart-services operates ... == aeolus-restart-services stop order == ntpd imagefactory conductor-dbomatic aeolus-conductor libvirtd deltacloud-core qpidd httpd postgresql iwhd mongod == aeolus-restart-services start order == mongod iwhd postgresql httpd qpidd deltacloud-core libvirtd aeolus-conductor conductor-dbomatic imagefactory ntpd * The order the each service stops and starts on a reboot ... == chkconfig start order == 20mongod 40iwhd 58ntpd 64postgresql 85httpd <-- DIFFERENT --> 85qpidd <-- DIFFERENT --> 90deltacloud-core 97libvirtd 99aeolus-conductor 99conductor-dbomatic 99imagefactory == chkconfig stop order == 01aeolus-conductor 01conductor-dbomatic 01imagefactory 03libvirtd 15httpd 15qpidd 36postgresql 60deltacloud-core 60iwhd 74ntpd 80mongod
Apologies, prematurely clicked <submit> Actual results: # aeolus-image list --images ERROR: => Failed. Response code = 503. Response message = Service Temporarily Unavailable. Expected results: # aeolus-image list --images ID Name OS OS Version Arch Description ------------------------------------ ------------ ------ ---------- ------ ----------- 4da54731-eaa4-4b14-9452-df4e0b5f78f3 configserver RHEL-6 2 x86_64 109bbb05-1685-4dba-8c71-39ab3a2b9c38 client RHEL-6 2 x86_64 63028c1f-841f-46a6-975f-df4038f8370d client RHEL-6 2 x86_64
Created attachment 555137 [details] boot log I suspect this might be related to conductor (or related component) not shutting down properly?
I'm not able to reproduce this on a fresh installation. The first error I see in the boot log is: waiting for mongod to listen on localhost:27017waiting for mongod to listen on localhost:27017[FAILED] If mongod is not starting then that certainly can explain the issue on reboot. Is there anything of note in /var/log/mongodb/mongodb.log ?
(In reply to comment #3) > I'm not able to reproduce this on a fresh installation. I'm having a hell of a time reproducing it now also. I noticed I was getting a kernel panic on reboot for this system, which caused the system to improperly shutdown. It's entirely possible that might have resulted in an incomplete service shutdown for conductor (or one of it's dependencies). If the components are not properly shutdown, that might also explain why they failed to start up. For example ... 1. Start from a working setup where the conductor web-ui works fine 2. echo b > /proc/sysrq-trigger 3. After reboot, attempt to access conductor I think this might point out that one (or more) of the initscripts may not properly handle their pid/lock files?
OK, that helped a lot and let me reproduce it easily by just force halting my testing VM. End result is that mongod fails to start on the next boot, here is the log: Tue Jan 17 15:03:21 [initandlisten] MongoDB starting : pid=1426 port=27017 dbpath=/var/lib/mongodb 64-bit Tue Jan 17 15:03:21 [initandlisten] db version v1.8.0, pdfile version 4.5 Tue Jan 17 15:03:21 [initandlisten] git version: nogitversion Tue Jan 17 15:03:21 [initandlisten] build sys info: Linux virtlab30.virt.bos.redhat.com 2.6.35.6-45.fc14.x86_64 #1 SMP Mon Oct 18 23:57:44 UTC 2010 x86_64 BOOST_LIB_VERSION=1_41 ************** old lock file: /var/lib/mongodb/mongod.lock. probably means unclean shutdown recommend removing file and running --repair see: http://dochub.mongodb.org/core/repair for more information ************* Tue Jan 17 15:03:21 [initandlisten] exception in initAndListen std::exception: old lock file, terminating Tue Jan 17 15:03:21 dbexit: Tue Jan 17 15:03:21 [initandlisten] shutdown: going to close listening sockets... Tue Jan 17 15:03:21 [initandlisten] shutdown: going to flush diaglog... Tue Jan 17 15:03:21 [initandlisten] shutdown: going to close sockets... Tue Jan 17 15:03:21 [initandlisten] shutdown: waiting for fs preallocator... Tue Jan 17 15:03:21 [initandlisten] shutdown: closing all files... Tue Jan 17 15:03:21 closeAllFiles() finished Tue Jan 17 15:03:21 dbexit: really exiting now The reason aeolus-restart-services works is because it explicitly handles this case: ## Other checks if perform(:status, ['mongod']) == [1] lockfile = '/var/lib/mongodb/mongod.lock' if File.exists?(lockfile) puts " \e[1;33mremoving\e[0m leftover #{lockfile}" File.delete(lockfile) perform :restart, %w(mongod iwhd) end end If you read the documentation at the given link (http://dochub.mongodb.org/core/repair) that behavior is frowned upon and may cause data loss. One option is to configure mongod to use journaling, which should avoid the failed startup. In mongod >= 1.9, this is the default; el6 has 1.8 so we would need to explicitly enable it. The downside is that the first time mongod starts, it preallocates the journal which takes several minutes. We would need to block iwhd startup for however long this takes. This is less than ideal. Another option would be to leave journaling off and introduce a script that runs early on in the init process (before mongod) that checks for the lockfile, and if it exists remove it and run a mongod --repair. This is a total hack.
Thanks for the comments. Just a quick note... as far as I can tell, RHEL6 has no mongodb yet (see bug#745171). I believe we are carrying it in our CloudForms puddles at this time. So if we wanted to update, that option may be available to us. Some of this will all be fixed when using systemd properly anyway. Unfortunately, that's not an option for RHEL6. :( (In reply to comment #5) > Another option would be to leave journaling off and introduce a script that > runs early on in the init process (before mongod) that checks for the lockfile, > and if it exists remove it and run a mongod --repair. This is a total hack. It seems appropriate to me that the mongodb initscript could be responsible for checking for the presence of a failed shutdown (no pid, but a lockfile exists) and automatically running --repair. Would that be worth pursuing?
per jrd's advice moving it to Jim for further comment and/or ideas for a final solution.
So far, I like John Eckersberg's suggestion to turn on mongodb's journalling. (add "journal = true" to /etc/mongodb.conf) Assuming we use that approach, there are some issues: What can we do about the ~3-minute (first-time-only) initialization delay while mongodb actually writes 3GB of files? How about doing that as part of the already long-running vm-creation process? It's hard to know how much that would add to the image size, but I suspect that the added 3GB would end up being very compressible. Continue with mongodb-1.8.2, or upgrade to something newer? I am a little leery of enabling journalling in our out-of-date mongodb-1.8.2 (1.8.4 is latest in that series, with a week-old 1.8.5-rc0) even though searching for journalling-related problems in 1.8.x didn't find much. IMHO, iwhd's use of mongodb is not likely to trigger many fringe bugs in mongodb, so simply using our tried and true 1.8.2 may be just fine. For the record, mongodb-2.0.2 is the "recommended" production release, per http://www.mongodb.org/downloads, and rawhide has mongodb-2.0.2-5. Adjust iwhd's init script to wait longer for mongod to start? Currently it waits for only 2 seconds: # How many seconds to wait for mongod to become usable before giving up. MONGOD_N_SECONDS=2 An alternative to the proposed off-line mongodb journal initialization would be to increase the maximum delay to say, 4 minutes, and to add checks ensuring that mongodb really is running in that case (just not yet responding) and to give reassuring diagnostics to help folks endure the long wait. I do not like this. 2-4 minutes is interminable... Finally, since we're considering rather fundamental changes (enabling journalling in the relatively-old mongodb-1.8.2, and/or upgrading to newer mongodb) to address the hard-to-hit problem of an irregularly-terminated mongodb process, I have to wonder... Should we consider another approach: simply documenting the recovery procedure and make iwhd propose that via its log if/when it finds that mongodb has not started in time?
I vote we document the issue, including the fact that it's not easy to provoke, and let users recover manually. I think at this point that's less risky than changing mogo versions or other big things. Wes, Mike?
(In reply to comment #9) > I vote we document the issue, including the fact that it's not easy to provoke, > and let users recover manually. I think at this point that's less risky than > changing mogo versions or other big things. > > Wes, Mike? I defer to Wes, but that seems like an appropriate short-term solution. We can document this for v1, and re-purpose this bug for a more appropriate long-term maintainable solution.
This has been punted to 1.1, per jrd.
(In reply to comment #11) > This has been punted to 1.1, per jrd. I believe we want to keep the version as the version the bug was found. We can adjust the cloudforms-X.Y.Z flag to target resolution for a different version. To punt on this, you can supply a devel_ack- ... or clear the flags and re-propose for cloudforms-1.1.0? The latter would just be easier, so I've done it.
Thank you, James.
FYI, mongod just got journalling enabled everywhere (F15,16, RHEL5,6)
By virtue of the mongod upgrade+journalling change of two months ago, this should be much less of an issue. There's still the possibility that someone using a pre-systemd system (rhel6) will hit the DB initialization problem on the very first service start, since that takes so much longer than the default 2-second timeout. On second and subsequent starts, the DB should be created and the 2-second timeout should be fine. Given the above and that no one has reported subsequent problems, I'm inclined to say we don't need a release note for this.
removing relnote flag per Jim.
Let's revisit this and make a final decision on whether to make any changes or close.
We can likely re-align this with next gen of imagefactory.
Aligning to 2.0
Dave, Can you please arrange for this to be retested?
This was ON_QA forever, closing.