Bug 838633 - Upgraded aeolus from 1.0 to 1.0.1, kept running for a week, results in FAILURE: mongod dead but subsys locked
Summary: Upgraded aeolus from 1.0 to 1.0.1, kept running for a week, results in FAILUR...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: mongodb
Version: el6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Marek Skalický
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-07-09 16:23 UTC by Ronelle Landy
Modified: 2015-03-08 22:47 UTC (History)
7 users (show)

Fixed In Version: mongodb-2.4.12-3.el6
Clone Of:
Environment:
Last Closed: 2015-03-08 22:47:35 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
iwhd.log (303.62 KB, application/x-compressed-tar)
2012-07-19 14:41 UTC, Aaron Weitekamp
no flags Details
mongodb.log (634 bytes, application/x-compressed-tar)
2012-07-19 14:43 UTC, Aaron Weitekamp
no flags Details
'service mongod start' results (1.20 KB, text/x-log)
2012-07-24 14:41 UTC, Aaron Weitekamp
no flags Details

Description Ronelle Landy 2012-07-09 16:23:47 UTC
Description of problem:

Creating this BZ for tracking ......

(From Aziza Karol) :
I upgraded aeolus from 1.0 to 1.0.1 and kept it running for a week..  when I checked the services, I see  "FAILURE: mongod dead but subsys locked.". Also, iwhd is stopped.


Version-Release number of selected component (if applicable):

rpms tested:

>> rpm -qa |grep aeolus
aeolus-configure-2.5.10-1.el6cf.noarch
aeolus-all-0.8.34-1.el6cf.noarch
aeolus-conductor-doc-0.8.34-1.el6cf.noarch
aeolus-conductor-daemons-0.8.34-1.el6cf.noarch
rubygem-aeolus-image-0.3.0-12.el6.noarch
rubygem-aeolus-cli-0.3.3-2.el6_2.noarch
aeolus-conductor-0.8.34-1.el6cf.noarch

>> rpm -qa |grep iwhd
iwhd-1.5-2.el6.x86_64


How reproducible:
retrying the test to see

Steps to Reproduce:
1. Install aeolus 1.0
2. Launch instances
3. Upgrade to aeolus 1.0.1
4. Leave aeolus running for a wekk - or at leat 8 hours
5. Run aeolus-check-service and then aeolus-restart-services

Actual results:

>>  aeolus-check-services

Checking mongod ...
 FAILURE: mongod dead but subsys locked

Checking iwhd ...
 FAILURE: iwhd is stopped

Checking postgresql ...
 Success: postmaster (pid  12515) is running...

Checking httpd ...
 Success: httpd (pid  12536) is running...

Checking deltacloud-core ...
 Success: deltacloud-core (pid  12570) is running...

Checking libvirtd ...
 Success: libvirtd (pid  12588) is running...

Checking aeolus-conductor ...
 Success: thin (pid  12648) is running...

Checking conductor-dbomatic ...
 Success: dbomatic (pid  12692) is running...

Checking imagefactory ...
 Success: imagefactory (pid  12709) is running...

Checking ntpd ...
 Success: ntpd (pid  23188) is running...
[root@dell-pe2950-1 ~]# aeolus-restart-services

Stoping ntpd ...
 Success: Shutting down ntpd: [  OK  ]

Stoping imagefactory ...
 Success: Stopping imagefactory: [  OK  ]

Stoping conductor-dbomatic ...
[  OK  ]: Shutting down conductor-dbomatic: [  OK  ]

Stoping aeolus-conductor ...
 Success: Shutting down thin: [  OK  ]

Stoping libvirtd ...
 Success: Stopping libvirtd daemon: [  OK  ]

Stoping deltacloud-core ...
 Success: Shutting down deltacloud-core: [  OK  ]

Stoping httpd ...
 Success: Stopping httpd: [  OK  ]

Stoping postgresql ...
 Success: Stopping postgresql service: [  OK  ]

Stoping iwhd ...
[  OK  ]: Stopping iwhd daemon:  [FAILED]

Stoping mongod ...
 Success: 

Starting mongod ...
 Success: Starting mongod: [  OK  ]

Starting iwhd ...
 FAILURE: waiting for mongod to listen on localhost:27017[FAILED]

Starting postgresql ...
 Success: Starting postgresql service: [  OK  ]

Starting httpd ...
 Success: Starting httpd: [  OK  ]

Starting deltacloud-core ...
 Success: Starting deltacloud-core-mock: [  OK  ]

Starting libvirtd ...
 Success: Starting libvirtd daemon: [  OK  ]

Starting aeolus-conductor ...
 Success: Starting thin: [  OK  ]

Starting conductor-dbomatic ...
[  OK  ]: Starting conductor-dbomatic: [  OK  ]

Starting imagefactory ...
 Success: Starting imagefactory: [  OK  ]

Starting ntpd ...
 Success: Starting ntpd: [  OK  ]

Statusing mongod ...
 FAILURE: mongod dead but subsys locked

------------

>> aeolus-restart-services

Stoping ntpd ...
 Success: Shutting down ntpd: [  OK  ]

Stoping imagefactory ...
 Success: Stopping imagefactory: [  OK  ]

Stoping conductor-dbomatic ...
[  OK  ]: Shutting down conductor-dbomatic: [  OK  ]

Stoping aeolus-conductor ...
 Success: Shutting down thin: [  OK  ]

Stoping libvirtd ...
 Success: Stopping libvirtd daemon: [  OK  ]

Stoping deltacloud-core ...
 Success: Shutting down deltacloud-core: [  OK  ]

Stoping httpd ...
 Success: Stopping httpd: [  OK  ]

Stoping postgresql ...
 Success: Stopping postgresql service: [  OK  ]

Stoping iwhd ...
[  OK  ]: Stopping iwhd daemon:  [FAILED]

Stoping mongod ...
 Success: 

Starting mongod ...
 Success: Starting mongod: [  OK  ]

Starting iwhd ...
 FAILURE: waiting for mongod to listen on localhost:27017[FAILED]

Starting postgresql ...
 Success: Starting postgresql service: [  OK  ]

Starting httpd ...
 Success: Starting httpd: [  OK  ]

Starting deltacloud-core ...
 Success: Starting deltacloud-core-mock: [  OK  ]

Starting libvirtd ...
 Success: Starting libvirtd daemon: [  OK  ]

Starting aeolus-conductor ...
 Success: Starting thin: [  OK  ]

Starting conductor-dbomatic ...
[  OK  ]: Starting conductor-dbomatic: [  OK  ]

Starting imagefactory ...
 Success: Starting imagefactory: [  OK  ]

Starting ntpd ...
 Success: Starting ntpd: [  OK  ]

Statusing mongod ...
 FAILURE: mongod dead but subsys locked


Expected results:
mongod and iwhd should not fail to start

Additional info:

Copying chat comments for tracking ....

<jim> rlandy: do you think you or akarol can reproduce the problem?  If so, I'd like to attach to the mongod process with strace or gdb, in case it chokes again.

<jim> rlandy: the iwhd log suggests that it ran for only 8 hours, starting on July 2

Comment 1 Aaron Weitekamp 2012-07-19 14:41:30 UTC
Created attachment 599171 [details]
iwhd.log

repro'd. iwhd.log attached

Comment 2 Aaron Weitekamp 2012-07-19 14:43:01 UTC
Created attachment 599172 [details]
mongodb.log

repro'd. mongodb.log attached

Comment 3 jrd 2012-07-19 17:40:38 UTC
Oops, sorry, I missed this with being on PTO.

Pete/Jim, thoughts?

Comment 4 Aaron Weitekamp 2012-07-20 15:58:14 UTC
Repro'd again. Seems to occur only when virt guest running conductor is not shut down cleanly. If guest is powered down and back up cleanly issue does not occur. I would suggest it's a low priority issue but the user is unable to perform key functionality.

Comment 5 Pete Zaitcev 2012-07-20 19:45:22 UTC
I still think that the right place to fix this is in Mongo, and this is why.
The root cause of the problem is that we (the Warehouse, and thus Conductor)
cannot restart cleanly from a crash, because Mongo won't start. The problem
is that operator is not in position to do fancy recovery stuff, if he/she
exists at all. And our scripts should not be too specific to particularities
of Mongo, e.g. renaming locks behind its back. For our application, Mongo
should have a mode (a command line switch), which tells it to execute an
unattended restart to the best of its abilities.

Comment 7 Pete Zaitcev 2012-07-24 01:04:04 UTC
Reading the log with a closer attention and after looking at the source
of aeolus-restart-services, it appears the following occurs:
 - aeolus-restart-services invokes perform(":start","mongod")
 - perform() succeeds!
 - aeolus-restart-services invokes iwhd, which reports that mongod failed

So, mongod fails, but tricks aeolus-restart-services into thinking that
succeeded, so the recovery is not attempted.

But wait, there's more! The aeolus-restart-services removes the lock
file in an attempt to correct the problem. Well, that part never works.
Whoever wrote it forgot to test it. But even if we fix up perform(),
or special-code a reliable perform() for Mongo, this is still incorrect.
Instead the script should run "mongod --repair". See:
http://www.mongodb.org/display/DOCS/Durability+and+Repair

It's in Ruby, but looks pretty clear:
https://github.com/aeolusproject/aeolus-configure/blob/master/bin/aeolus-restart-services

Ronelle or Aaron, please try a couple of things, when the problem arises.

First, run "service mongod start" by itself and let us know what happens.
According to the log above, it should report success, but then mongod exits
with log in comment #2.

Assuming it transpires as I foresee, run "mongod --repair" naked, without
service script. If it prints anything, copy it to the bug (but please no
tar.gz this time). After that the system should continue to work. However,
it might not, if a database incompatibility is introduced with the move
from 1.0 to 1.0.1. I doubt it's the case (because otherwise it would not
take you guys a week to find out), but let's verify.

Comment 8 Aaron Weitekamp 2012-07-24 14:41:42 UTC
Created attachment 600042 [details]
'service mongod start' results

Here are the results:

[root@rhel6 ~]# service mongod start
Starting mongod:                                           [  OK  ]

###
#[note] mongodb sees old lock file
#        attempts --repair but fails with exception.
#        mongodb-2012-07-24.log attached
#        Attempt manual repair. Output:
###

[root@rhel6 mongodb]# mongod --repair
Tue Jul 24 10:31:57 [initandlisten] MongoDB starting : pid=2929 port=27017 dbpath=/data/db/ 64-bit 
Tue Jul 24 10:31:57 [initandlisten] db version v1.8.2, pdfile version 4.5
Tue Jul 24 10:31:57 [initandlisten] git version: nogitversion
Tue Jul 24 10:31:57 [initandlisten] build sys info: Linux x86-001.build.bos.redhat.com 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST 2012 x86_64 BOOST_LIB_VERSION=1_41
Tue Jul 24 10:31:57 [initandlisten] finished checking dbs
Tue Jul 24 10:31:57 dbexit: 
Tue Jul 24 10:31:57 [initandlisten] shutdown: going to close listening sockets...
Tue Jul 24 10:31:57 [initandlisten] shutdown: going to flush diaglog...
Tue Jul 24 10:31:57 [initandlisten] shutdown: going to close sockets...
Tue Jul 24 10:31:57 [initandlisten] shutdown: waiting for fs preallocator...
Tue Jul 24 10:31:57 [initandlisten] shutdown: closing all files...
Tue Jul 24 10:31:57 closeAllFiles() finished
Tue Jul 24 10:31:57 [initandlisten] shutdown: removing fs lock...
Tue Jul 24 10:31:57 dbexit: really exiting now
[root@rhel6 mongodb]#

Comment 9 Pete Zaitcev 2012-07-24 21:57:50 UTC
Thanks a lot, Aaron, exactly what we needed.

So basically the root issue is that Mongo's service lies to us about
the successful start.

I could throw this bug over to Nathan, see what he thinks. Maybe he
can fix it. It isn't trivial: server daemonizes and exists, now what?
The only way to make sure is to run that connection to DB and wait
if it times out (the same way iwhd does the testing, so the fragment
of the start-up script may be reused).

Comment 11 Aaron Weitekamp 2012-08-01 20:06:55 UTC
Instructions for recovering mongodb from a crash
http://documentation-stage.bne.redhat.com/docs/en-US/CloudForms/1.0/html/Cloud_Engine_User_Guide/appe-User_Documentation-iwhd.html

Recovery from database failure
If CloudForms Cloud Engine fails to load after a host reboot, check the MongoDB service's status.

# service mongod status

MongoDB is a database service that stores attributes and metadata for Image Warehouse. In the event of an improper system shut down, MongoDB can fail to start on boot due to the presence of a database lockfile. Systems affected include the following information in the mongo log (/var/log/mongodb/mongodb.log)

old lock file: /var/lib/mongodb/mongod.lock.  probably means unclean shutdown
recommend removing file and running --repair
see: http://dochub.mongodb.org/core/repair for more information

If MongoDB refuses to start on reboot, remove the database lock using the following command:

# rm -f /var/lib/mongodb/mongod.lock
# mongod --repair

Reboot the host and access CloudForms Cloud Engine via the web interface.

Comment 12 Pete Zaitcev 2012-08-01 23:23:09 UTC
Here's the relevant code from iwhd that I hope is useful to wait for
the Mongo daemon to come up:

MONGOD_SERVER_SPEC=localhost:27017

wait_for_mongod() {
    # Wait for up to $1 seconds for mongod to begin listening.
    wait_for .1 $(($1 * 10)) 'mongo $MONGOD_SERVER_SPEC \
      < /dev/null >/dev/null 2>&1'
}

    printf %s $"waiting for mongod to listen on $MONGOD_SERVER_SPEC"
    wait_for_mongod $MONGOD_N_SECONDS && echo_success \
        || { echo_failure; echo; return 1; }
    echo

Remember that RHEL 6 is pre-systemd system. The above goes verbatim
into /etc/rc.d script.

Comment 14 Nathaniel McCallum 2012-08-02 21:44:56 UTC
Please let me know if the above update fixes your problem.

Comment 15 Pete Zaitcev 2012-08-02 21:54:47 UTC
Asking for a re-test with mongodb-2.0.6-2.el6 out of band.

Nathaniel also found a problem with my proposal in comment #12:
the startup scripts do not really know where to find the socket of
the server they just started. The configuration may be arbitrarily
moved out of the way by means of /etc/sysconfig/mongod, for one.

Comment 16 Marek Skalický 2015-02-11 14:15:29 UTC
If I understand well, this problem was done because mongod reported it is started. But the preallocation was in progress and mongod wasn't able accept connections.

This should be now fixed in el6.

If this issue is still actual feel free to reopen.

Comment 17 Fedora Update System 2015-02-11 14:43:38 UTC
mongodb-2.4.12-3.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/mongodb-2.4.12-3.el6

Comment 18 Fedora Update System 2015-02-14 02:52:46 UTC
Package mongodb-2.4.12-3.el6:
* should fix your issue,
* was pushed to the Fedora EPEL 6 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing mongodb-2.4.12-3.el6'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2015-0744/mongodb-2.4.12-3.el6
then log in and leave karma (feedback).

Comment 19 Fedora Update System 2015-03-08 22:47:35 UTC
mongodb-2.4.12-3.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.