+++ This bug was initially created as a clone of Bug #1378472 +++ I am deploying a puddle from Sep 20th. Clean deployment. Overcloud deployment fails because MongoDB is trying to talk to itself and form/reassure a cluster when it's not yet ready to do so. Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: rs.add() failed to add host to replicaset tripleo: replSetReconfig command must be sent to the current replica set primary. (truncated, view all with --long) overcloud.ComputeNodesPostDeployment.ComputeOvercloudServicesDeployment_Step4.0: resource_type: OS::Heat::StructuredDeployment physical_resource_id: d9489409-1bce-47e0-b11a-1a19e58234cf status: CREATE_FAILED --- Additional comment from Jiri Stransky on 2016-09-22 09:56:30 EDT --- We were hitting clustering issues with MongoDB upstream too and fixed them recently (linking bug and patch). The message we caught was slightly different: Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: Can't find master host for replicaset tripleo. But given that the issue is intermittent, it's possible that the likely race condition might have manifested in slightly different ways, so perhaps it's the same bug. I'm not sure the fix will prevent the error above too, but given that we no longer hit MongoDB clustering issues upstream after that fix, i'm hopeful we won't hit them downstream either. --- Additional comment from Marius Cornea on 2016-09-22 10:12:57 EDT --- (In reply to Jiri Stransky from comment #1) > We were hitting clustering issues with MongoDB upstream too and fixed them > recently (linking bug and patch). The message we caught was slightly > different: > > Error: > /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/ > Mongodb_replset[tripleo]: Could not evaluate: Can't find master host for > replicaset tripleo. > > But given that the issue is intermittent, it's possible that the likely race > condition might have manifested in slightly different ways, so perhaps it's > the same bug. I'm not sure the fix will prevent the error above too, but > given that we no longer hit MongoDB clustering issues upstream after that > fix, i'm hopeful we won't hit them downstream either. From what I've seen on Lesik's setup it failed with the following error on controller-0: Debug: Executing '/bin/mongo admin --quiet --host 10.35.169.11:27017 --eval printjson(rs.add('10.35.169.10:27017'))' Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: rs.add() failed to add host to replicaset tripleo:replSetReconfig command must be sent to the current replica set primary. And this happened because 10.35.169.11 was a secondary replica and not primary. [root@overcloud-controller-0 heat-admin]# /bin/mongo admin --quiet --host 10.35.169.11:27017 <<<'rs.status()' { "set" : "tripleo", "date" : ISODate("2016-09-22T11:16:23Z"), "myState" : 2, "syncingTo" : "10.35.169.16:27017", "members" : [ { "_id" : 0, "name" : "10.35.169.11:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 398, "optime" : Timestamp(1474539604, 9), "optimeDate" : ISODate("2016-09-22T10:20:04Z"), "self" : true }, { "_id" : 1, "name" : "10.35.169.16:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 398, "optime" : Timestamp(1474539604, 9), "optimeDate" : ISODate("2016-09-22T10:20:04Z"), "lastHeartbeat" : ISODate("2016-09-22T11:16:21Z"), "lastHeartbeatRecv" : ISODate("2016-09-22T11:16:22Z"), "pingMs" : 0, "electionTime" : Timestamp(1474542593, 1), "electionDate" : ISODate("2016-09-22T11:09:53Z") } ], "ok" : 1 } --- Additional comment from Jiri Stransky on 2016-09-22 10:42:10 EDT --- Thanks for the additional info. I think the mongodb puppet module should have built-in logic to send this to the primary node when adding a node into the cluster. I'm still wondering if this might be caused by the same race condition that we try to solve (and looks like we solved) by waiting for connectability. I'm hopeful but not certain :) https://github.com/puppetlabs/puppetlabs-mongodb/blob/1cfb235894795f216ce3ae3fc02eb52d112e9197/lib/puppet/provider/mongodb_replset/mongo.rb#L107-L115 https://github.com/puppetlabs/puppetlabs-mongodb/blob/1cfb235894795f216ce3ae3fc02eb52d112e9197/lib/puppet/provider/mongodb_replset/mongo.rb#L233-L256 So i wonder if we're just trying to connect to the cluster while it's in some intermediate state, perhaps after restart due to a config change or something like that. The interesting part about the upstream issue was that we couldn't find the master node during step 3 intermittently, while the mongodb cluster has been previously successfully formed in step 2. --- Additional comment from Marius Cornea on 2016-11-03 11:46:19 EDT --- I'm sporadically seeing this issue when I'm scaling out from 2 nodes running Mongo to 3 nodes: with different errors: Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: Can't find master host for replicaset tripleo. Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: rs.add() failed to add host to replicaset tripleo: replSetReconfig command must be sent to the current replica set primary. I checked the puppet-tripleo module and it contains the referenced patch. --- Additional comment from Jiri Stransky on 2016-11-04 10:22:16 EDT --- Would you have some logs of the failures? Possibly at least the os-collect-config + heat resource-list, so that we can see which step failed and whether it was after a mongodb restart (due to a config change for example) etc. Do we ever hit this on initial deployment nowadays, or only on updates? I suspect this issue will be very elusive, so any info that can help us narrow down when it happens is much appreciated. --- Additional comment from Jiri Stransky on 2016-11-04 10:24:14 EDT --- Also i think we could try collecting any workaround info -- e.g. when we run the same `overcloud deploy` command for 2nd time after the failed scale-up, does it succeed? --- Additional comment from Marius Cornea on 2016-11-05 07:56:05 EDT --- The issue is showing only on scale out scenarios so far so I'm cloning this BZ to a new one so we can keep track of it. Will provide the required info on the new one.
Stack update succeeds when re-running the deploy command after the failed attempt.
Marius can we say that the "stack update" re-run is valid workaround?
12:20 < slagle> mcornea: for the mongodb scale out issue, you can just rerun the deploy command and it will fix it, correct? 12:20 < mcornea> slagle: yes, it completes fine when rerun
confirmed not a blocker, but we do need to provide the doc_text here. jistr, can you fill that out?
note: mongodb has been retired from RDO dependencies repo in queens [1] [1] https://lists.rdoproject.org/pipermail/dev/2018-January/008518.html
Switching this to documentation because we've retired MongoDB and there doesn't appear to be a code fix for the issue.
This has been documented as a Known Issue in our Release Notes: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html/release_notes/chap-release_notes#idm139726210904576 Closing this BZ. If any further changes are required, please let us know.