1378472 – Overcloud deployment fails because MongoDB is trying to talk to itself and form/reassure a cluster when it's not yet ready to do so.

Bug 1378472 - Overcloud deployment fails because MongoDB is trying to talk to itself and form/reassure a cluster when it's not yet ready to do so.

Summary: Overcloud deployment fails because MongoDB is trying to talk to itself and fo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	10.0 (Newton)
Assignee:	Jiri Stransky
QA Contact:	Omri Hochman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1392155
TreeView+	depends on / blocked

Reported:	2016-09-22 13:35 UTC by Leonid Natapov
Modified:	2016-12-14 16:04 UTC (History)
CC List:	13 users (show)
Fixed In Version:	puppet-tripleo-5.3.0-6.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1392155 (view as bug list)
Environment:
Last Closed:	2016-12-14 16:04:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1624420	None	None	None	2016-09-22 13:57:58 UTC
OpenStack gerrit	371596	'None'	'MERGED'	'Wait for MongoDB connections before creating replset'	2019-11-26 13:43:05 UTC
Red Hat Product Errata	RHEA-2016:2948	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Description Leonid Natapov 2016-09-22 13:35:20 UTC

I am deploying a puddle from Sep 20th. Clean deployment.

Overcloud deployment fails because MongoDB is trying to talk to itself and form/reassure a cluster when it's not yet ready to do so.

    Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: rs.add() failed to add host to replicaset tripleo: replSetReconfig command must be sent to the current replica set primary.
    (truncated, view all with --long)
overcloud.ComputeNodesPostDeployment.ComputeOvercloudServicesDeployment_Step4.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: d9489409-1bce-47e0-b11a-1a19e58234cf
  status: CREATE_FAILED

Comment 1 Jiri Stransky 2016-09-22 13:56:30 UTC

We were hitting clustering issues with MongoDB upstream too and fixed them recently (linking bug and patch). The message we caught was slightly different:

Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: Can't find master host for replicaset tripleo.

But given that the issue is intermittent, it's possible that the likely race condition might have manifested in slightly different ways, so perhaps it's the same bug. I'm not sure the fix will prevent the error above too, but given that we no longer hit MongoDB clustering issues upstream after that fix, i'm hopeful we won't hit them downstream either.

Comment 2 Marius Cornea 2016-09-22 14:12:57 UTC

(In reply to Jiri Stransky from comment #1)
> We were hitting clustering issues with MongoDB upstream too and fixed them
> recently (linking bug and patch). The message we caught was slightly
> different:
> 
> Error:
> /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/
> Mongodb_replset[tripleo]: Could not evaluate: Can't find master host for
> replicaset tripleo.
> 
> But given that the issue is intermittent, it's possible that the likely race
> condition might have manifested in slightly different ways, so perhaps it's
> the same bug. I'm not sure the fix will prevent the error above too, but
> given that we no longer hit MongoDB clustering issues upstream after that
> fix, i'm hopeful we won't hit them downstream either.

From what I've seen on Lesik's setup it failed with the following error on controller-0:

Debug: Executing '/bin/mongo admin --quiet --host 10.35.169.11:27017 --eval printjson(rs.add('10.35.169.10:27017'))'                                           
Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: rs.add() failed to add host to replicaset tripleo:replSetReconfig command must be sent to the current replica set primary.

And this happened because 10.35.169.11 was a secondary replica and not primary.

[root@overcloud-controller-0 heat-admin]# /bin/mongo admin --quiet --host 10.35.169.11:27017 <<<'rs.status()'
{
	"set" : "tripleo",
	"date" : ISODate("2016-09-22T11:16:23Z"),
	"myState" : 2,
	"syncingTo" : "10.35.169.16:27017",
	"members" : [
		{
			"_id" : 0,
			"name" : "10.35.169.11:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 398,
			"optime" : Timestamp(1474539604, 9),
			"optimeDate" : ISODate("2016-09-22T10:20:04Z"),
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "10.35.169.16:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 398,
			"optime" : Timestamp(1474539604, 9),
			"optimeDate" : ISODate("2016-09-22T10:20:04Z"),
			"lastHeartbeat" : ISODate("2016-09-22T11:16:21Z"),
			"lastHeartbeatRecv" : ISODate("2016-09-22T11:16:22Z"),
			"pingMs" : 0,
			"electionTime" : Timestamp(1474542593, 1),
			"electionDate" : ISODate("2016-09-22T11:09:53Z")
		}
	],
	"ok" : 1
}

Comment 3 Jiri Stransky 2016-09-22 14:42:10 UTC

Thanks for the additional info. I think the mongodb puppet module should have built-in logic to send this to the primary node when adding a node into the cluster.

I'm still wondering if this might be caused by the same race condition that we try to solve (and looks like we solved) by waiting for connectability. I'm hopeful but not certain :)


https://github.com/puppetlabs/puppetlabs-mongodb/blob/1cfb235894795f216ce3ae3fc02eb52d112e9197/lib/puppet/provider/mongodb_replset/mongo.rb#L107-L115

https://github.com/puppetlabs/puppetlabs-mongodb/blob/1cfb235894795f216ce3ae3fc02eb52d112e9197/lib/puppet/provider/mongodb_replset/mongo.rb#L233-L256


So i wonder if we're just trying to connect to the cluster while it's in some intermediate state, perhaps after restart due to a config change or something like that. The interesting part about the upstream issue was that we couldn't find the master node during step 3 intermittently, while the mongodb cluster has been previously successfully formed in step 2.

Comment 4 Marius Cornea 2016-11-03 15:46:19 UTC

I'm sporadically seeing this issue when I'm scaling out from 2 nodes running Mongo to 3 nodes:

with different errors:

Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: Can't find master host for replicaset tripleo.

Error: /Stage[main]/Tripleo::Profile::Base::Database::Mongodb/Mongodb_replset[tripleo]: Could not evaluate: rs.add() failed to add host to replicaset tripleo: replSetReconfig command must be sent to the current replica set primary.


I checked the puppet-tripleo module and it contains the referenced patch.

Comment 5 Jiri Stransky 2016-11-04 14:22:16 UTC

Would you have some logs of the failures? Possibly at least the os-collect-config + heat resource-list, so that we can see which step failed and whether it was after a mongodb restart (due to a config change for example) etc.

Do we ever hit this on initial deployment nowadays, or only on updates?

I suspect this issue will be very elusive, so any info that can help us narrow down when it happens is much appreciated.

Comment 6 Jiri Stransky 2016-11-04 14:24:14 UTC

Also i think we could try collecting any workaround info -- e.g. when we run the same `overcloud deploy` command for 2nd time after the failed scale-up, does it succeed?

Comment 7 Marius Cornea 2016-11-05 11:56:05 UTC

The issue is showing only on scale out scenarios so far so I'm cloning this BZ to a new one so we can keep track of it. Will provide the required info on the new one.

Comment 11 Marius Cornea 2016-11-22 09:35:21 UTC

I wasn't able to reproduce this issue with so I'm moving it to verified.

Comment 14 errata-xmlrpc 2016-12-14 16:04:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.