Description of problem: Occasionally during CI test runs, the mongodb-2.2 embedded cart will sometimes erroneously spawn multiple processes during a call to the restart hook. The mongodb_ctl.sh script has timing-sensitive logic to do process detection, which leaves the potential for a false negative check and stacked mongod invocations. The exactly conditions which can reproduce this are not yet known, but it has been observed in recent fork AMI test executions. An analysis of the logs revealed the call to the restart hook and subsequent bogus processes (3 mongod procs where 1 was expected post-restart). Version-Release number of selected component (if applicable): How reproducible: Unknown at this time. Steps to Reproduce: N/A Actual results: More than one mongod process following an app restart. Expected results: A single mongod process. Additional info:
Created attachment 641921 [details] Cucumber logs demonstrating the behavior I've attached a portion of the cucumber logs demonstrating the before and after state of the process table during a mongo restart exhibiting the problematic behavior.
Unable to reproduce the issue. One possible scenario is repair getting called. Even if three process are spawned, only one will bind to the port.
Marking this low since we can't reproduce it.
Investigated further - [pid 20261] 1354144886.641274 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.060418> [pid 20261] 1354144886.641316 futex(0x11bb240, FUTEX_WAIT_PRIVATE, 2, {0, 124225760} <unfinished ...> [pid 20263] 1354144886.683156 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.101931> [pid 20263] 1354144886.683198 futex(0x11bb240, FUTEX_WAIT_PRIVATE, 2, {0, 50865920}) = -1 ETIMEDOUT (Connection timed out) <0.051009> [pid 20263] 1354144886.734285 futex(0x11bb240, FUTEX_WAIT_PRIVATE, 2, {0, 71312592} <unfinished ...> [pid 20260] 1354144886.736554 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.237626> [pid 20260] 1354144886.736595 futex(0x11bb240, FUTEX_WAIT_PRIVATE, 2, {0, 258809056} <unfinished ...> [pid 20261] 1354144886.765659 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.124319> [pid 20261] 1354144886.765713 futex(0x11bb240, FUTEX_WAIT_PRIVATE, 2, {0, 181012160}^C <unfinished ...> (gdb) bt #0 0x000000000098e650 in sys_futex (w=0x11bb240, value=2, loop=<value optimized out>) at src/third_party/gperftools-2.0/src/base/linux_syscall_support.h:1789 #1 base::internal::SpinLockDelay (w=0x11bb240, value=2, loop=<value optimized out>) at src/third_party/gperftools-2.0/src/base/spinlock_linux-inl.h:87 #2 0x000000000098e43c in SpinLock::SlowLock (this=0x11bb240) at src/third_party/gperftools-2.0/src/base/spinlock.cc:132 #3 0x00000000009a37f3 in Lock () at src/third_party/gperftools-2.0/src/base/spinlock.h:75 #4 SpinLockHolder () at src/third_party/gperftools-2.0/src/base/spinlock.h:141 #5 tcmalloc::ThreadCache::CreateCacheIfNecessary () at src/third_party/gperftools-2.0/src/thread_cache.cc:341 #6 0x000000000099d9fa in GetCache (size=8) at src/third_party/gperftools-2.0/src/thread_cache.h:389 #7 (anonymous namespace)::do_malloc (size=8) at src/third_party/gperftools-2.0/src/tcmalloc.cc:1058 #8 0x0000000000a3de3c in do_malloc_or_cpp_alloc (size=8) at src/third_party/gperftools-2.0/src/tcmalloc.cc:1005 #9 tc_malloc (size=8) at src/third_party/gperftools-2.0/src/tcmalloc.cc:1492 #10 0x0000003c5a60ff2a in boost::detail::get_once_per_thread_epoch() () from /usr/lib64/libboost_thread-mt.so.5 There is a bug filed against mongo upstream: https://jira.mongodb.org/browse/SERVER-7434
Will try to compile with flag to avoid the issue. However, we have no definitive test to tell that it is fixed.
The rpm compiled with the flag did not resolve the issue. We will have to wait for upstream to fix this.
*** Bug 906654 has been marked as a duplicate of this bug. ***
The bug has been fixed in mongo 2.2.3 https://jira.mongodb.org/browse/SERVER/fixforversion/12088
(In reply to comment #8) > The bug has been fixed in mongo 2.2.3 > https://jira.mongodb.org/browse/SERVER/fixforversion/12088 There's a FutureFeature keyword added since 2012/12/12 in the bug, will this bug still be a FutureFeature bug since we have already upgraded mongo to 2.2.2. Thanks -xiaoli
The version of mongodb in devenv_2772 is 2.2.2, and hence, this bug was sometimes reproducible. Specifically, for "snapshot and restore" cases, most of them were failed with the error message during restoration. couldn't connect to [127.0.250.129:27017] couldn't connect to server 127.0.250.129:27017 ============================================================================ WARNING: You may have possibly encountered the mongorestore bugs related to MongoDB JIRA issues 7181, 7262 and 7104. We tried working around some these issues. You will need to manually workaround the remaining problems prior to proceeding. For more details, see: https://jira.mongodb.org/browse/SERVER-7181 https://jira.mongodb.org/browse/SERVER-7262 https://jira.mongodb.org/browse/SERVER-7104 ============================================================================ MongoDB already running !WARNING! Could not restore MongoDB databases - mongorestore failed! Moreover, when connecting the mongodb after the (failed) restoration, the error reported was: Error: couldn't connect to server 127.0.252.1:27017 src/mongo/shell/mongo.js:93
I'm removing the FutureFeature keyword for now to request evaluation again since 2.2.2 is used in this sprint.
As bug 906654 is duplicated to this bug, so updated the latest comments for mongodb on devenv_2772 here: If try to force stop the app for adding mongodb, it would be added successfully, but failed to connect to mongo shell as below: [app2-dev2772dmt1.dev.rhcloud.com ~]\> mongo MongoDB shell version: 2.2.2 connecting to: 127.1.0.129:27017/admin Tue Feb 5 00:56:05 Error: couldn't connect to server 127.1.0.129:27017 src/mongo/shell/mongo.js:93 exception: connect failed
Verified on devenv_2778 The multi-process issue is now gone, notice the mongo shell version is 2.2.3 [php1-2778t.dev.rhcloud.com ~]\> mongo MongoDB shell version: 2.2.3 connecting to: 127.0.252.129:27017/admin Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see http://docs.mongodb.org/ Questions? Try the support group http://groups.google.com/group/mongodb-user > Create app and add mongodb to it, then ssh into gear and run 'ps -ef', there is only one mongod process running. Restart app/restart mongodb, and check again, this issue is not reproducible now. [php1-2778t.dev.rhcloud.com ~]\> ps -ef|grep mongo 505 19782 1 0 21:16 ? 00:00:00 /usr/bin/mongod --auth --nojournal --smallfiles --quiet -f /var/lib/openshift/5111ba9fdefdbcad6d000454//mongodb-2.2//etc/mongodb.conf run 505 20329 17959 0 21:16 pts/0 00:00:00 grep mongo Moving to verified.