Bug 1392155 - Scaling out the nodes that run the MongoDB service sporadically fails
Summary: Scaling out the nodes that run the MongoDB service sporadically fails
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 11.0 (Ocata)
Assignee: RHOS Documentation Team
QA Contact: RHOS Documentation Team
URL:
Whiteboard:
Depends On: 1378472
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-05 11:57 UTC by Marius Cornea
Modified: 2018-03-20 01:22 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
A race condition may occur between Puppet and MongoDB services. Consequently, scaling out nodes running the MongoDB database fails and the overcloud stack will not update. Running the same deployment command again makes the MongoDB nodes scale out successfully.
Clone Of: 1378472
Environment:
Last Closed: 2018-03-20 01:22:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marius Cornea 2016-11-05 11:57:30 UTC
+++ This bug was initially created as a clone of Bug #1378472 +++

I am deploying a puddle from Sep 20th. Clean deployment.

Overcloud deployment fails because MongoDB is trying to talk to itself and form/reassure a cluster when it's not yet ready to do so.

    Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: rs.add() failed to add host to replicaset tripleo: replSetReconfig command must be sent to the current replica set primary.
    (truncated, view all with --long)
overcloud.ComputeNodesPostDeployment.ComputeOvercloudServicesDeployment_Step4.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: d9489409-1bce-47e0-b11a-1a19e58234cf
  status: CREATE_FAILED

--- Additional comment from Jiri Stransky on 2016-09-22 09:56:30 EDT ---

We were hitting clustering issues with MongoDB upstream too and fixed them recently (linking bug and patch). The message we caught was slightly different:

Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: Can't find master host for replicaset tripleo.

But given that the issue is intermittent, it's possible that the likely race condition might have manifested in slightly different ways, so perhaps it's the same bug. I'm not sure the fix will prevent the error above too, but given that we no longer hit MongoDB clustering issues upstream after that fix, i'm hopeful we won't hit them downstream either.

--- Additional comment from Marius Cornea on 2016-09-22 10:12:57 EDT ---

(In reply to Jiri Stransky from comment #1)
> We were hitting clustering issues with MongoDB upstream too and fixed them
> recently (linking bug and patch). The message we caught was slightly
> different:
> 
> Error:
> /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/
> Mongodb_replset[tripleo]: Could not evaluate: Can't find master host for
> replicaset tripleo.
> 
> But given that the issue is intermittent, it's possible that the likely race
> condition might have manifested in slightly different ways, so perhaps it's
> the same bug. I'm not sure the fix will prevent the error above too, but
> given that we no longer hit MongoDB clustering issues upstream after that
> fix, i'm hopeful we won't hit them downstream either.

From what I've seen on Lesik's setup it failed with the following error on controller-0:

Debug: Executing '/bin/mongo admin --quiet --host 10.35.169.11:27017 --eval printjson(rs.add('10.35.169.10:27017'))'                                           
Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: rs.add() failed to add host to replicaset tripleo:replSetReconfig command must be sent to the current replica set primary.

And this happened because 10.35.169.11 was a secondary replica and not primary.

[root@overcloud-controller-0 heat-admin]# /bin/mongo admin --quiet --host 10.35.169.11:27017 <<<'rs.status()'
{
	"set" : "tripleo",
	"date" : ISODate("2016-09-22T11:16:23Z"),
	"myState" : 2,
	"syncingTo" : "10.35.169.16:27017",
	"members" : [
		{
			"_id" : 0,
			"name" : "10.35.169.11:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 398,
			"optime" : Timestamp(1474539604, 9),
			"optimeDate" : ISODate("2016-09-22T10:20:04Z"),
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "10.35.169.16:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 398,
			"optime" : Timestamp(1474539604, 9),
			"optimeDate" : ISODate("2016-09-22T10:20:04Z"),
			"lastHeartbeat" : ISODate("2016-09-22T11:16:21Z"),
			"lastHeartbeatRecv" : ISODate("2016-09-22T11:16:22Z"),
			"pingMs" : 0,
			"electionTime" : Timestamp(1474542593, 1),
			"electionDate" : ISODate("2016-09-22T11:09:53Z")
		}
	],
	"ok" : 1
}

--- Additional comment from Jiri Stransky on 2016-09-22 10:42:10 EDT ---

Thanks for the additional info. I think the mongodb puppet module should have built-in logic to send this to the primary node when adding a node into the cluster.

I'm still wondering if this might be caused by the same race condition that we try to solve (and looks like we solved) by waiting for connectability. I'm hopeful but not certain :)


https://github.com/puppetlabs/puppetlabs-mongodb/blob/1cfb235894795f216ce3ae3fc02eb52d112e9197/lib/puppet/provider/mongodb_replset/mongo.rb#L107-L115

https://github.com/puppetlabs/puppetlabs-mongodb/blob/1cfb235894795f216ce3ae3fc02eb52d112e9197/lib/puppet/provider/mongodb_replset/mongo.rb#L233-L256


So i wonder if we're just trying to connect to the cluster while it's in some intermediate state, perhaps after restart due to a config change or something like that. The interesting part about the upstream issue was that we couldn't find the master node during step 3 intermittently, while the mongodb cluster has been previously successfully formed in step 2.

--- Additional comment from Marius Cornea on 2016-11-03 11:46:19 EDT ---

I'm sporadically seeing this issue when I'm scaling out from 2 nodes running Mongo to 3 nodes:

with different errors:

Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: Can't find master host for replicaset tripleo.

Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: rs.add() failed to add host to replicaset tripleo: replSetReconfig command must be sent to the current replica set primary.


I checked the puppet-tripleo module and it contains the referenced patch.

--- Additional comment from Jiri Stransky on 2016-11-04 10:22:16 EDT ---

Would you have some logs of the failures? Possibly at least the os-collect-config + heat resource-list, so that we can see which step failed and whether it was after a mongodb restart (due to a config change for example) etc.

Do we ever hit this on initial deployment nowadays, or only on updates?

I suspect this issue will be very elusive, so any info that can help us narrow down when it happens is much appreciated.

--- Additional comment from Jiri Stransky on 2016-11-04 10:24:14 EDT ---

Also i think we could try collecting any workaround info -- e.g. when we run the same `overcloud deploy` command for 2nd time after the failed scale-up, does it succeed?

--- Additional comment from Marius Cornea on 2016-11-05 07:56:05 EDT ---

The issue is showing only on scale out scenarios so far so I'm cloning this BZ to a new one so we can keep track of it. Will provide the required info on the new one.

Comment 2 Marius Cornea 2016-11-05 12:29:45 UTC
Stack update succeeds when re-running the deploy command after the failed attempt.

Comment 3 Jaromir Coufal 2016-11-07 16:01:02 UTC
Marius can we say that the "stack update" re-run is valid workaround?

Comment 4 James Slagle 2016-11-07 17:22:24 UTC
12:20 <    slagle> mcornea: for the mongodb scale out issue, you can just rerun the deploy command and it will fix it, correct?
12:20 <   mcornea> slagle: yes, it completes fine when rerun

Comment 5 James Slagle 2016-11-07 17:23:03 UTC
confirmed not a blocker, but we do need to provide the doc_text here. jistr, can you fill that out?

Comment 9 Bogdan Dobrelya 2018-02-12 11:57:01 UTC
note: mongodb has been retired from RDO dependencies repo in queens [1]
    
[1] https://lists.rdoproject.org/pipermail/dev/2018-January/008518.html

Comment 10 Alex Schultz 2018-03-02 18:21:50 UTC
Switching this to documentation because we've retired MongoDB and there doesn't appear to be a code fix for the issue.

Comment 11 Dan Macpherson 2018-03-20 01:22:05 UTC
This has been documented as a Known Issue in our Release Notes:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html/release_notes/chap-release_notes#idm139726210904576

Closing this BZ. If any further changes are required, please let us know.


Note You need to log in before you can comment on or make changes to this bug.