1264048 – couldn't create mongodb replica set in openshift cluster with multiple node sometime

Bug 1264048 - couldn't create mongodb replica set in openshift cluster with multiple node sometime

Summary: couldn't create mongodb replica set in openshift cluster with multiple node s...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OKD
Classification:	Red Hat
Component:	Image
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Ben Parees
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-17 11:56 UTC by Wang Haoran
Modified:	2016-12-01 22:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-01 22:46:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Wang Haoran 2015-09-17 11:56:25 UTC

Description of problem:

when the openshift cluster have more than 2 node under schedule ,cannot create mongodb replica set with the images mongodb-24-centos7 and mongodb-24-rhel7 sometime
Version-Release number of selected component (if applicable):
mongodb-24-centos7:8250de002ec4
mongodb-24-rhel7: 354cfc2e45d9
How reproducible:

sometime (hight percentage)
Steps to Reproduce:
1.create a project
2.down load the file:
  https://raw.githubusercontent.com/openshift/mongodb/master/2.4/examples/replica/mongodb-clustered.json 
3.update the image in the file mongodb-clustered.json 
4. oc process -f mongodb-clustered.json |oc create -f -
Actual results:
cluster cannot be created with the status:
rs0:SECONDARY> rs.status()
{
	"set" : "rs0",
	"date" : ISODate("2015-09-17T11:35:03Z"),
	"myState" : 2,
	"syncingTo" : "10.1.2.32:27017",
	"members" : [
		{
			"_id" : 0,
			"name" : "10.1.2.33:27017",
			"health" : 0,
			"state" : 8,
			"stateStr" : "(not reachable/healthy)",
			"uptime" : 0,
			"optime" : Timestamp(1442489547, 1),
			"optimeDate" : ISODate("2015-09-17T11:32:27Z"),
			"lastHeartbeat" : ISODate("2015-09-17T11:34:58Z"),
			"lastHeartbeatRecv" : ISODate("2015-09-17T11:32:54Z"),
			"pingMs" : 0,
			"syncingTo" : "10.1.2.32:27017"
		},
		{
			"_id" : 1,
			"name" : "10.1.2.31:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 194,
			"optime" : Timestamp(1442489547, 1),
			"optimeDate" : ISODate("2015-09-17T11:32:27Z"),
			"self" : true
		},
		{
			"_id" : 2,
			"name" : "10.1.2.32:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 171,
			"optime" : Timestamp(1442489547, 1),
			"optimeDate" : ISODate("2015-09-17T11:32:27Z"),
			"lastHeartbeat" : ISODate("2015-09-17T11:35:02Z"),
			"lastHeartbeatRecv" : ISODate("2015-09-17T11:35:02Z"),
			"pingMs" : 0,
			"syncingTo" : "10.1.2.33:27017"
		}
	],
	"ok" : 1
}


Expected results:
cluster should be created

Additional info:
it works with a single node openshift cluster
http://ur1.ca/nt5hm

Comment 1 Rodolfo Carvalho 2015-09-21 08:26:25 UTC

Wang Haoran, what's the status of the pods at the point you inspect the replica set status?

Seems that the replica set is created, we have a PRIMARY and a SECONDARY, you can connect to them, etc, but one of the pods is unreachable. Thus, I'd say the problem is not "couldn't create replica set" but "replica set cannot access a pod after initialization".

This seems more like a network issue to me.

Could you please check the pod status and also verify that you can access the pods by IP from one node to the other?

Comment 2 Wang Haoran 2015-09-21 09:16:09 UTC

1. all the pods are running
2. for the pod with status "stateStr" : "(not reachable/healthy)"  in the cluster ,the ip is incorrect ,when inspect the pod , we will found this pod have a different ip ,that means we don't have a pod with ip 10.1.2.33, don't know why the ip disappeared.
3. we can access the pods by IP from one node to the other using ping command

Comment 3 Rodolfo Carvalho 2015-09-21 09:31:35 UTC

Wang Haoran, could you try this command?


dig mongodb A +search +short


Run this from a pod in node1, and then in a pod in node2.
It's what we use to find the ips (https://github.com/openshift/mongodb/blob/master/2.4/contrib/common.sh#L86)


Michal, when you implemented MongoDB replication, do you remember if you tested this in a multi-node setup?

Comment 4 Wang Haoran 2015-09-21 09:41:02 UTC

FYI
http://ur1.ca/ntyjd

Comment 5 Rodolfo Carvalho 2015-09-21 10:21:10 UTC

Wang Haoran, I talked with Michal, and the conclusion we came up is that this might be slow DNS...

When the initialization of the replica set runs, the DNS server is out-dated, making us set the wrong IPs to the replica set config.

As you demonstrated, running the same dig command later gives the right IPs.


Fixing this would require changing the way how we setup the replica set to constantly monitoring for new IPs and remove unreachable IPs.

I've added this BZ to an existing Trello card:
https://trello.com/c/YoDX7nsm

With your agreement we can track it from there.

Comment 6 Wang Haoran 2015-09-22 01:39:37 UTC

That's ok, let's track it from the card.

Comment 7 Rodolfo Carvalho 2016-02-10 10:55:49 UTC

This PR might help https://github.com/openshift/mongodb/pull/98

Comment 8 Rodolfo Carvalho 2016-06-23 20:27:06 UTC

Wang Haoran, could you please try to reproduce this to see in which state we are?
There's been lots of changes to the image since this was reported.

Thanks!

Comment 9 Wang Haoran 2016-06-24 03:09:26 UTC

Hi, using the release mongodb-24-rhel7 image , 
after oc new-app mongodb-cluster.json

oc get pod:
[root@host-8-172-89 ~]# oc get pod
NAME                  READY     STATUS             RESTARTS   AGE
mongodb-1-3e21x       1/1       Running            0          5m
mongodb-1-deploy      1/1       Running            0          10m
mongodb-1-hook-post   0/1       CrashLoopBackOff   6          10m
mongodb-1-pv39a       1/1       Running            0          10m
mongodb-1-qcj5d       1/1       Running            0          10m


oc logs mongodb-1-hook-post

https://paste.fedoraproject.org/383955/46673772/

Comment 10 Ben Parees 2016-12-01 22:46:25 UTC

the old style mongo cluster/replica support is no longer considered strategic (we've moved to using petsets) so i'm closing this as won't fix.

Note You need to log in before you can comment on or make changes to this bug.