Description of problem: Creating this BZ for tracking ...... (From Aziza Karol) : I upgraded aeolus from 1.0 to 1.0.1 and kept it running for a week.. when I checked the services, I see "FAILURE: mongod dead but subsys locked.". Also, iwhd is stopped. Version-Release number of selected component (if applicable): rpms tested: >> rpm -qa |grep aeolus aeolus-configure-2.5.10-1.el6cf.noarch aeolus-all-0.8.34-1.el6cf.noarch aeolus-conductor-doc-0.8.34-1.el6cf.noarch aeolus-conductor-daemons-0.8.34-1.el6cf.noarch rubygem-aeolus-image-0.3.0-12.el6.noarch rubygem-aeolus-cli-0.3.3-2.el6_2.noarch aeolus-conductor-0.8.34-1.el6cf.noarch >> rpm -qa |grep iwhd iwhd-1.5-2.el6.x86_64 How reproducible: retrying the test to see Steps to Reproduce: 1. Install aeolus 1.0 2. Launch instances 3. Upgrade to aeolus 1.0.1 4. Leave aeolus running for a wekk - or at leat 8 hours 5. Run aeolus-check-service and then aeolus-restart-services Actual results: >> aeolus-check-services Checking mongod ... FAILURE: mongod dead but subsys locked Checking iwhd ... FAILURE: iwhd is stopped Checking postgresql ... Success: postmaster (pid 12515) is running... Checking httpd ... Success: httpd (pid 12536) is running... Checking deltacloud-core ... Success: deltacloud-core (pid 12570) is running... Checking libvirtd ... Success: libvirtd (pid 12588) is running... Checking aeolus-conductor ... Success: thin (pid 12648) is running... Checking conductor-dbomatic ... Success: dbomatic (pid 12692) is running... Checking imagefactory ... Success: imagefactory (pid 12709) is running... Checking ntpd ... Success: ntpd (pid 23188) is running... [root@dell-pe2950-1 ~]# aeolus-restart-services Stoping ntpd ... Success: Shutting down ntpd: [ OK ] Stoping imagefactory ... Success: Stopping imagefactory: [ OK ] Stoping conductor-dbomatic ... [ OK ]: Shutting down conductor-dbomatic: [ OK ] Stoping aeolus-conductor ... Success: Shutting down thin: [ OK ] Stoping libvirtd ... Success: Stopping libvirtd daemon: [ OK ] Stoping deltacloud-core ... Success: Shutting down deltacloud-core: [ OK ] Stoping httpd ... Success: Stopping httpd: [ OK ] Stoping postgresql ... Success: Stopping postgresql service: [ OK ] Stoping iwhd ... [ OK ]: Stopping iwhd daemon: [FAILED] Stoping mongod ... Success: Starting mongod ... Success: Starting mongod: [ OK ] Starting iwhd ... FAILURE: waiting for mongod to listen on localhost:27017[FAILED] Starting postgresql ... Success: Starting postgresql service: [ OK ] Starting httpd ... Success: Starting httpd: [ OK ] Starting deltacloud-core ... Success: Starting deltacloud-core-mock: [ OK ] Starting libvirtd ... Success: Starting libvirtd daemon: [ OK ] Starting aeolus-conductor ... Success: Starting thin: [ OK ] Starting conductor-dbomatic ... [ OK ]: Starting conductor-dbomatic: [ OK ] Starting imagefactory ... Success: Starting imagefactory: [ OK ] Starting ntpd ... Success: Starting ntpd: [ OK ] Statusing mongod ... FAILURE: mongod dead but subsys locked ------------ >> aeolus-restart-services Stoping ntpd ... Success: Shutting down ntpd: [ OK ] Stoping imagefactory ... Success: Stopping imagefactory: [ OK ] Stoping conductor-dbomatic ... [ OK ]: Shutting down conductor-dbomatic: [ OK ] Stoping aeolus-conductor ... Success: Shutting down thin: [ OK ] Stoping libvirtd ... Success: Stopping libvirtd daemon: [ OK ] Stoping deltacloud-core ... Success: Shutting down deltacloud-core: [ OK ] Stoping httpd ... Success: Stopping httpd: [ OK ] Stoping postgresql ... Success: Stopping postgresql service: [ OK ] Stoping iwhd ... [ OK ]: Stopping iwhd daemon: [FAILED] Stoping mongod ... Success: Starting mongod ... Success: Starting mongod: [ OK ] Starting iwhd ... FAILURE: waiting for mongod to listen on localhost:27017[FAILED] Starting postgresql ... Success: Starting postgresql service: [ OK ] Starting httpd ... Success: Starting httpd: [ OK ] Starting deltacloud-core ... Success: Starting deltacloud-core-mock: [ OK ] Starting libvirtd ... Success: Starting libvirtd daemon: [ OK ] Starting aeolus-conductor ... Success: Starting thin: [ OK ] Starting conductor-dbomatic ... [ OK ]: Starting conductor-dbomatic: [ OK ] Starting imagefactory ... Success: Starting imagefactory: [ OK ] Starting ntpd ... Success: Starting ntpd: [ OK ] Statusing mongod ... FAILURE: mongod dead but subsys locked Expected results: mongod and iwhd should not fail to start Additional info: Copying chat comments for tracking .... <jim> rlandy: do you think you or akarol can reproduce the problem? If so, I'd like to attach to the mongod process with strace or gdb, in case it chokes again. <jim> rlandy: the iwhd log suggests that it ran for only 8 hours, starting on July 2
Created attachment 599171 [details] iwhd.log repro'd. iwhd.log attached
Created attachment 599172 [details] mongodb.log repro'd. mongodb.log attached
Oops, sorry, I missed this with being on PTO. Pete/Jim, thoughts?
Repro'd again. Seems to occur only when virt guest running conductor is not shut down cleanly. If guest is powered down and back up cleanly issue does not occur. I would suggest it's a low priority issue but the user is unable to perform key functionality.
I still think that the right place to fix this is in Mongo, and this is why. The root cause of the problem is that we (the Warehouse, and thus Conductor) cannot restart cleanly from a crash, because Mongo won't start. The problem is that operator is not in position to do fancy recovery stuff, if he/she exists at all. And our scripts should not be too specific to particularities of Mongo, e.g. renaming locks behind its back. For our application, Mongo should have a mode (a command line switch), which tells it to execute an unattended restart to the best of its abilities.
Reading the log with a closer attention and after looking at the source of aeolus-restart-services, it appears the following occurs: - aeolus-restart-services invokes perform(":start","mongod") - perform() succeeds! - aeolus-restart-services invokes iwhd, which reports that mongod failed So, mongod fails, but tricks aeolus-restart-services into thinking that succeeded, so the recovery is not attempted. But wait, there's more! The aeolus-restart-services removes the lock file in an attempt to correct the problem. Well, that part never works. Whoever wrote it forgot to test it. But even if we fix up perform(), or special-code a reliable perform() for Mongo, this is still incorrect. Instead the script should run "mongod --repair". See: http://www.mongodb.org/display/DOCS/Durability+and+Repair It's in Ruby, but looks pretty clear: https://github.com/aeolusproject/aeolus-configure/blob/master/bin/aeolus-restart-services Ronelle or Aaron, please try a couple of things, when the problem arises. First, run "service mongod start" by itself and let us know what happens. According to the log above, it should report success, but then mongod exits with log in comment #2. Assuming it transpires as I foresee, run "mongod --repair" naked, without service script. If it prints anything, copy it to the bug (but please no tar.gz this time). After that the system should continue to work. However, it might not, if a database incompatibility is introduced with the move from 1.0 to 1.0.1. I doubt it's the case (because otherwise it would not take you guys a week to find out), but let's verify.
Created attachment 600042 [details] 'service mongod start' results Here are the results: [root@rhel6 ~]# service mongod start Starting mongod: [ OK ] ### #[note] mongodb sees old lock file # attempts --repair but fails with exception. # mongodb-2012-07-24.log attached # Attempt manual repair. Output: ### [root@rhel6 mongodb]# mongod --repair Tue Jul 24 10:31:57 [initandlisten] MongoDB starting : pid=2929 port=27017 dbpath=/data/db/ 64-bit Tue Jul 24 10:31:57 [initandlisten] db version v1.8.2, pdfile version 4.5 Tue Jul 24 10:31:57 [initandlisten] git version: nogitversion Tue Jul 24 10:31:57 [initandlisten] build sys info: Linux x86-001.build.bos.redhat.com 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST 2012 x86_64 BOOST_LIB_VERSION=1_41 Tue Jul 24 10:31:57 [initandlisten] finished checking dbs Tue Jul 24 10:31:57 dbexit: Tue Jul 24 10:31:57 [initandlisten] shutdown: going to close listening sockets... Tue Jul 24 10:31:57 [initandlisten] shutdown: going to flush diaglog... Tue Jul 24 10:31:57 [initandlisten] shutdown: going to close sockets... Tue Jul 24 10:31:57 [initandlisten] shutdown: waiting for fs preallocator... Tue Jul 24 10:31:57 [initandlisten] shutdown: closing all files... Tue Jul 24 10:31:57 closeAllFiles() finished Tue Jul 24 10:31:57 [initandlisten] shutdown: removing fs lock... Tue Jul 24 10:31:57 dbexit: really exiting now [root@rhel6 mongodb]#
Thanks a lot, Aaron, exactly what we needed. So basically the root issue is that Mongo's service lies to us about the successful start. I could throw this bug over to Nathan, see what he thinks. Maybe he can fix it. It isn't trivial: server daemonizes and exists, now what? The only way to make sure is to run that connection to DB and wait if it times out (the same way iwhd does the testing, so the fragment of the start-up script may be reused).
Instructions for recovering mongodb from a crash http://documentation-stage.bne.redhat.com/docs/en-US/CloudForms/1.0/html/Cloud_Engine_User_Guide/appe-User_Documentation-iwhd.html Recovery from database failure If CloudForms Cloud Engine fails to load after a host reboot, check the MongoDB service's status. # service mongod status MongoDB is a database service that stores attributes and metadata for Image Warehouse. In the event of an improper system shut down, MongoDB can fail to start on boot due to the presence of a database lockfile. Systems affected include the following information in the mongo log (/var/log/mongodb/mongodb.log) old lock file: /var/lib/mongodb/mongod.lock. probably means unclean shutdown recommend removing file and running --repair see: http://dochub.mongodb.org/core/repair for more information If MongoDB refuses to start on reboot, remove the database lock using the following command: # rm -f /var/lib/mongodb/mongod.lock # mongod --repair Reboot the host and access CloudForms Cloud Engine via the web interface.
Here's the relevant code from iwhd that I hope is useful to wait for the Mongo daemon to come up: MONGOD_SERVER_SPEC=localhost:27017 wait_for_mongod() { # Wait for up to $1 seconds for mongod to begin listening. wait_for .1 $(($1 * 10)) 'mongo $MONGOD_SERVER_SPEC \ < /dev/null >/dev/null 2>&1' } printf %s $"waiting for mongod to listen on $MONGOD_SERVER_SPEC" wait_for_mongod $MONGOD_N_SECONDS && echo_success \ || { echo_failure; echo; return 1; } echo Remember that RHEL 6 is pre-systemd system. The above goes verbatim into /etc/rc.d script.
FYI: https://admin.fedoraproject.org/updates/FEDORA-EPEL-2012-6574/mongodb-2.0.6-2.el6
Please let me know if the above update fixes your problem.
Asking for a re-test with mongodb-2.0.6-2.el6 out of band. Nathaniel also found a problem with my proposal in comment #12: the startup scripts do not really know where to find the socket of the server they just started. The configuration may be arbitrarily moved out of the way by means of /etc/sysconfig/mongod, for one.
If I understand well, this problem was done because mongod reported it is started. But the preallocation was in progress and mongod wasn't able accept connections. This should be now fixed in el6. If this issue is still actual feel free to reopen.
mongodb-2.4.12-3.el6 has been submitted as an update for Fedora EPEL 6. https://admin.fedoraproject.org/updates/mongodb-2.4.12-3.el6
Package mongodb-2.4.12-3.el6: * should fix your issue, * was pushed to the Fedora EPEL 6 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=epel-testing mongodb-2.4.12-3.el6' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-EPEL-2015-0744/mongodb-2.4.12-3.el6 then log in and leave karma (feedback).
mongodb-2.4.12-3.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.