Bug 1478171

Summary: mongod cluster will not start
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: mongodbAssignee: Flavio Percoco <fpercoco>
Status: CLOSED NOTABUG QA Contact: Shai Revivo <srevivo>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: asoni, fpercoco, jmelvin, pkilambi, srevivo
Target Milestone: ---Keywords: Unconfirmed
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-03 15:23:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jeremy 2017-08-03 19:16:50 UTC
Description of problem: during overcloud update mongod will not start, systemctl restart mongod starts the service for about 30 seconds then it failed state again. Logs complain of error connecting to other nodes...


Version-Release number of selected component (if applicable):
installed-rpms |grep mongo
mongodb-2.6.11-1.el7ost.x86_64                              Wed Feb  1 07:03:10 2017
mongodb-server-2.6.11-1.el7ost.x86_64                       Wed Feb  1 07:03:02 2017
puppet-mongodb-0.16.0-1.el7ost.noarch                       Wed Feb  1 07:09:09 2017
python-pymongo-3.0.3-1.el7ost.x86_64                        Wed Feb  1 06:47:53 2017



Additional info:

tail var/log/mongodb/mongodb.log  from controller 0.

2017-08-02T09:08:37.756-0700 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
2017-08-02T09:08:38.756-0700 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
2017-08-02T09:08:39.756-0700 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
2017-08-02T09:08:40.756-0700 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
2017-08-02T09:08:41.666-0700 [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2017-08-02T09:08:41.666-0700 [signalProcessingThread] now exiting
2017-08-02T09:08:41.666-0700 [signalProcessingThread] dbexit: 
2017-08-02T09:08:41.666-0700 [signalProcessingThread] shutdown: going to close listening sockets...
2017-08-02T09:08:41.666-0700 [signalProcessingThread] closing listening socket: 11
2017-08-02T09:08:41.666-0700 [signalProcessingThread] shutdown: going to flush diaglog...
2017-08-02T09:08:41.666-0700 [signalProcessingThread] shutdown: going to close sockets...
2017-08-02T09:08:41.667-0700 [signalProcessingThread] shutdown: waiting for fs preallocator...
2017-08-02T09:08:41.667-0700 [signalProcessingThread] shutdown: lock for final commit...
2017-08-02T09:08:41.667-0700 [signalProcessingThread] shutdown: final commit...
2017-08-02T09:08:41.667-0700 [signalProcessingThread] shutdown: closing all files...
2017-08-02T09:08:41.667-0700 [signalProcessingThread] closeAllFiles() finished
2017-08-02T09:08:41.667-0700 [signalProcessingThread] journalCleanup...
2017-08-02T09:08:41.667-0700 [signalProcessingThread] removeJournalFiles
2017-08-02T09:08:41.668-0700 [signalProcessingThread] shutdown: removing fs lock...
2017-08-02T09:08:41.668-0700 [signalProcessingThread] dbexit: really exiting now

Comment 3 Jeremy 2017-08-07 18:14:38 UTC
Mongo --repair doesn't seem to work.

[root@controller-1 mongodb]# mongod --repair --dbpath /var/lib/mongodb/
2017-08-07T11:08:15.472-0700 [initandlisten] MongoDB starting : pid=784584 port=27017 dbpath=/var/lib/mongodb/ 64-bit host=controller-1.localdomain
2017-08-07T11:08:15.472-0700 [initandlisten] 
2017-08-07T11:08:15.472-0700 [initandlisten] ** WARNING: You are running on a NUMA machine.
2017-08-07T11:08:15.472-0700 [initandlisten] **          We suggest launching mongod like this to avoid performance problems:
2017-08-07T11:08:15.472-0700 [initandlisten] **              numactl --interleave=all mongod [other options]
2017-08-07T11:08:15.472-0700 [initandlisten] 
2017-08-07T11:08:15.472-0700 [initandlisten] db version v2.6.11
2017-08-07T11:08:15.472-0700 [initandlisten] git version: nogitversion
2017-08-07T11:08:15.472-0700 [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2017-08-07T11:08:15.472-0700 [initandlisten] build info: Linux x86-021.build.eng.bos.redhat.com 2.6.32-573.12.1.el6.x86_64 #1 SMP Mon Nov 23 12:55:32 EST 2015 x86_64 BOOST_LIB_VERSION=1_53
2017-08-07T11:08:15.472-0700 [initandlisten] allocator: tcmalloc
2017-08-07T11:08:15.472-0700 [initandlisten] options: { repair: true, storage: { dbPath: "/var/lib/mongodb/" } }
2017-08-07T11:08:15.473-0700 [initandlisten] 
2017-08-07T11:08:15.473-0700 [initandlisten] ** WARNING: Readahead for /var/lib/mongodb/ is set to 4096KB
2017-08-07T11:08:15.473-0700 [initandlisten] **          We suggest setting it to 256KB (512 sectors) or less
2017-08-07T11:08:15.473-0700 [initandlisten] **          http://dochub.mongodb.org/core/readahead
**************
Error: journal files are present in journal directory, yet starting without journaling enabled.
It is recommended that you start with journaling enabled so that recovery may occur.
**************
2017-08-07T11:08:15.473-0700 [initandlisten] exception in initAndListen: 13597 can't start without --journal enabled when journal/ files are present, terminating
2017-08-07T11:08:15.473-0700 [initandlisten] dbexit: 
2017-08-07T11:08:15.473-0700 [initandlisten] shutdown: going to close listening sockets...
2017-08-07T11:08:15.473-0700 [initandlisten] shutdown: going to flush diaglog...
2017-08-07T11:08:15.473-0700 [initandlisten] shutdown: going to close sockets...
2017-08-07T11:08:15.473-0700 [initandlisten] shutdown: waiting for fs preallocator...
2017-08-07T11:08:15.473-0700 [initandlisten] shutdown: closing all files...
2017-08-07T11:08:15.473-0700 [initandlisten] closeAllFiles() finished
2017-08-07T11:08:15.473-0700 [initandlisten] shutdown: removing fs lock...
2017-08-07T11:08:15.473-0700 [initandlisten] dbexit: really exiting now
[root@controller-1 mongodb]# systemctl restart mongod
Job for mongod.service failed because the control process exited with error code. See "systemctl status mongod.service" and "journalctl -xe" for details.
[root@controller-1 mongodb]# systemctl status mongod
● mongod.service - High-performance, schema-free document-oriented database
   Loaded: loaded (/usr/lib/systemd/system/mongod.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2017-08-07 11:08:31 PDT; 10s ago
  Process: 785672 ExecStart=/usr/bin/mongod $OPTIONS run (code=exited, status=100)
 Main PID: 874396 (code=exited, status=0/SUCCESS)

Aug 07 11:08:30 controller-1.localdomain systemd[1]: Starting High-performance, schema-free document-oriented database...
Aug 07 11:08:30 controller-1.localdomain mongod[785672]: about to fork child process, waiting until server is ready for connections.
Aug 07 11:08:30 controller-1.localdomain mongod[785672]: forked process: 785674
Aug 07 11:08:31 controller-1.localdomain mongod[785672]: ERROR: child process failed, exited with error number 100
Aug 07 11:08:31 controller-1.localdomain systemd[1]: mongod.service: control process exited, code=exited status=100
Aug 07 11:08:31 controller-1.localdomain systemd[1]: Failed to start High-performance, schema-free document-oriented database.
Aug 07 11:08:31 controller-1.localdomain systemd[1]: Unit mongod.service entered failed state.
Aug 07 11:08:31 controller-1.localdomain systemd[1]: mongod.service failed.
[root@controller-1 mongodb]#

Comment 4 Flavio Percoco 2017-08-08 15:51:30 UTC
Greetings,

could you please retry the repair command with `--journal`?

By looking at the `/etc` dirs in the sosreports, I don't seem to find the mongodb configuration files. I would have expected to see a file `/etc/mongod.conf`.

Comment 5 Jeremy 2017-08-08 17:33:26 UTC
[root@controller-0 ~]# mongod --repair --journal --dbpath  /var/lib/mongodb/
BadValue Can't have journaling enabled when using --repair option.
try 'mongod --help' for more information

-----------------------controller-0--------------
[root@controller-0 ~]# 
[root@controller-0 ~]# cat  /etc/mongod.conf
# mongodb.conf - generated from Puppet
#where to log
logpath=/var/log/mongodb/mongodb.log
logappend=true
# Set this option to configure the mongod or mongos process to bind to and
# listen for connections from applications on this address.
# You may concatenate a list of comma separated values to bind mongod to multiple IP addresses.
bind_ip = 192.168.10.17
# fork and run in background
fork=true
dbpath=/var/lib/mongodb
# location of pidfile
pidfilepath=/var/run/mongodb/mongod.pid
# Turn on/off security.  Off is currently the default
noauth=true
# Configure ReplicaSet membership
replSet = tripleo
[root@controller-0 ~]# 
---------------controller-1------
[root@controller-1 ~]# cat  /etc/mongod.conf
# mongodb.conf - generated from Puppet
#where to log
logpath=/var/log/mongodb/mongodb.log
logappend=true
# Set this option to configure the mongod or mongos process to bind to and
# listen for connections from applications on this address.
# You may concatenate a list of comma separated values to bind mongod to multiple IP addresses.
bind_ip = 192.168.10.21
# fork and run in background
fork=true
dbpath=/var/lib/mongodb
# location of pidfile
pidfilepath=/var/run/mongodb/mongod.pid
# Turn on/off security.  Off is currently the default
noauth=true
# Configure ReplicaSet membership
replSet = tripleo
[root@controller-1 ~]# 
-----------------------------------controller-2----------
[root@controller-2 ~]# cat  /etc/mongod.conf
# mongodb.conf - generated from Puppet
#where to log
logpath=/var/log/mongodb/mongodb.log
logappend=true
# Set this option to configure the mongod or mongos process to bind to and
# listen for connections from applications on this address.
# You may concatenate a list of comma separated values to bind mongod to multiple IP addresses.
bind_ip = 192.168.10.13
# fork and run in background
fork=true
dbpath=/var/lib/mongodb
# location of pidfile
pidfilepath=/var/run/mongodb/mongod.pid
# Turn on/off security.  Off is currently the default
noauth=true
# Configure ReplicaSet membership
replSet = tripleo
[root@controller-2 ~]#

Comment 9 Flavio Percoco 2017-08-11 08:20:26 UTC
We had a bomgar session on August 10th at 14:00 UTC to debug the issue on the environment and find a solution for it.

The findings of the investigation were:

* The oplog collection file was corrupted.
* Some of the ceilometer.events files were corrupted
* The data in the controller-0 and controller-1 was inconsistent with the data in controller-2
* Every node had journal files present, which suggests there were crashes in every node.
* The controller-1 had journal entries for missing database files. This suggests there was a manual intervention on data dir.

The corruption of the database fails seems to have happened due to some failed repair operations, failed journal recoveries and/or a manual intervention on the database. This can also happen after a storage failure or a network failure during the replication process. Based on the data, it seems that controller-2 was the PRIMARY node before the crash.

The customer said that the ceilometer data in the nodes was not important for them and that it would be ok for us to wipe it and re-configure the replica set. So I did.

The new replicaset uses the same configurations as the old one but the data was completely erased. A backup can be found under `/tmp/mongobackup` as it was created by the customer before the bomgar session.

The environment is back to a working state now.

Comment 12 Red Hat Bugzilla 2023-09-14 04:02:01 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days