Description of problem: when the openshift cluster have more than 2 node under schedule ,cannot create mongodb replica set with the images mongodb-24-centos7 and mongodb-24-rhel7 sometime Version-Release number of selected component (if applicable): mongodb-24-centos7:8250de002ec4 mongodb-24-rhel7: 354cfc2e45d9 How reproducible: sometime (hight percentage) Steps to Reproduce: 1.create a project 2.down load the file: https://raw.githubusercontent.com/openshift/mongodb/master/2.4/examples/replica/mongodb-clustered.json 3.update the image in the file mongodb-clustered.json 4. oc process -f mongodb-clustered.json |oc create -f - Actual results: cluster cannot be created with the status: rs0:SECONDARY> rs.status() { "set" : "rs0", "date" : ISODate("2015-09-17T11:35:03Z"), "myState" : 2, "syncingTo" : "10.1.2.32:27017", "members" : [ { "_id" : 0, "name" : "10.1.2.33:27017", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(1442489547, 1), "optimeDate" : ISODate("2015-09-17T11:32:27Z"), "lastHeartbeat" : ISODate("2015-09-17T11:34:58Z"), "lastHeartbeatRecv" : ISODate("2015-09-17T11:32:54Z"), "pingMs" : 0, "syncingTo" : "10.1.2.32:27017" }, { "_id" : 1, "name" : "10.1.2.31:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 194, "optime" : Timestamp(1442489547, 1), "optimeDate" : ISODate("2015-09-17T11:32:27Z"), "self" : true }, { "_id" : 2, "name" : "10.1.2.32:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 171, "optime" : Timestamp(1442489547, 1), "optimeDate" : ISODate("2015-09-17T11:32:27Z"), "lastHeartbeat" : ISODate("2015-09-17T11:35:02Z"), "lastHeartbeatRecv" : ISODate("2015-09-17T11:35:02Z"), "pingMs" : 0, "syncingTo" : "10.1.2.33:27017" } ], "ok" : 1 } Expected results: cluster should be created Additional info: it works with a single node openshift cluster http://ur1.ca/nt5hm
Wang Haoran, what's the status of the pods at the point you inspect the replica set status? Seems that the replica set is created, we have a PRIMARY and a SECONDARY, you can connect to them, etc, but one of the pods is unreachable. Thus, I'd say the problem is not "couldn't create replica set" but "replica set cannot access a pod after initialization". This seems more like a network issue to me. Could you please check the pod status and also verify that you can access the pods by IP from one node to the other?
1. all the pods are running 2. for the pod with status "stateStr" : "(not reachable/healthy)" in the cluster ,the ip is incorrect ,when inspect the pod , we will found this pod have a different ip ,that means we don't have a pod with ip 10.1.2.33, don't know why the ip disappeared. 3. we can access the pods by IP from one node to the other using ping command
Wang Haoran, could you try this command? dig mongodb A +search +short Run this from a pod in node1, and then in a pod in node2. It's what we use to find the ips (https://github.com/openshift/mongodb/blob/master/2.4/contrib/common.sh#L86) Michal, when you implemented MongoDB replication, do you remember if you tested this in a multi-node setup?
FYI http://ur1.ca/ntyjd
Wang Haoran, I talked with Michal, and the conclusion we came up is that this might be slow DNS... When the initialization of the replica set runs, the DNS server is out-dated, making us set the wrong IPs to the replica set config. As you demonstrated, running the same dig command later gives the right IPs. Fixing this would require changing the way how we setup the replica set to constantly monitoring for new IPs and remove unreachable IPs. I've added this BZ to an existing Trello card: https://trello.com/c/YoDX7nsm With your agreement we can track it from there.
That's ok, let's track it from the card.
This PR might help https://github.com/openshift/mongodb/pull/98
Wang Haoran, could you please try to reproduce this to see in which state we are? There's been lots of changes to the image since this was reported. Thanks!
Hi, using the release mongodb-24-rhel7 image , after oc new-app mongodb-cluster.json oc get pod: [root@host-8-172-89 ~]# oc get pod NAME READY STATUS RESTARTS AGE mongodb-1-3e21x 1/1 Running 0 5m mongodb-1-deploy 1/1 Running 0 10m mongodb-1-hook-post 0/1 CrashLoopBackOff 6 10m mongodb-1-pv39a 1/1 Running 0 10m mongodb-1-qcj5d 1/1 Running 0 10m oc logs mongodb-1-hook-post https://paste.fedoraproject.org/383955/46673772/
the old style mongo cluster/replica support is no longer considered strategic (we've moved to using petsets) so i'm closing this as won't fix.