Bug 906885

Summary:	mco ping does not work after initial boot, would work if user restarts mcollective
Product:	OpenShift Container Platform	Reporter:	Peter Ruan <pruan>
Component:	Node	Assignee:	Jason DeTiberus <jdetiber>
Status:	CLOSED DUPLICATE	QA Contact:	libra bugs <libra-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	1.1.0	CC:	bfallonf, bleanhar, jdetiber, jpazdziora, libra-onpremise-devel, lmeyer, mmasters, myroslav, xjia
Target Milestone:	---	Keywords:	Reopened, Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-02-04 14:39:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	978492
Bug Blocks:

Description Peter Ruan 2013-02-01 19:10:45 UTC

Description of problem:

mco ping does not work after initial boot, would work if user restarts mcollective 


[root@broker ~]# mco ping


---- ping statistics ----
No responses received
[root@broker ~]# /etc/init.d/activemq status
ActiveMQ Broker is running (2097).
[root@broker ~]# /etc/init.d/mcollective restart
Shutting down mcollective:                                 [  OK  ]
Starting mcollective:                                      [  OK  ]
[root@broker ~]# mco ping
broker.example.com                       time=70.11 ms


---- ping statistics ----
1 replies max: 70.11 min: 70.11 avg: 70.11 


Version-Release number of selected component (if applicable):


How reproducible:
always.

Steps to Reproduce:
1. install a desktop version of RHEL6
2. run installation from https://github.com/openshift/enterprise
3. reboot system
4. do 'mco ping'
  
Actual results:
no response

Expected results:
reply from broker node

Additional info:

Comment 7 Brenton Leanhardt 2013-05-30 13:09:27 UTC

I haven't been able to reproduce this though I suspect others have seen this happen based on error reports.  I spent some time looking through the stomp connector code and we are using the default reconnect_delay of 5 seconds in our openshift.sh installation script.  The default is also infinite retries.

Below are my stomp connection settings:

connector = stomp
plugin.stomp.host = activemq.example.com
plugin.stomp.port = 61613
plugin.stomp.user = mcollective
plugin.stomp.password = marionette

That said, Online has switched to using the activemq connector so we will do the same.  An issue such as this could easily be caused by a bug in the stomp connector.  We'll need to retest this scenario once we officially switch connectors.

Also, I don't think the registerinternval is used for what we think: http://docs.puppetlabs.com/mcollective/reference/plugins/registration.html.  I'm also going to remove that setting.

Comment 8 Brenton Leanhardt 2013-08-15 15:08:44 UTC

If this happens again please let us know.  For now we're going to close it.

Comment 9 Luke Meyer 2013-09-20 00:28:20 UTC

I think I can at least give a reproduceable scenario.

1. On broker: service activemq stop
2. Reboot node.
3. After the node returns and mcollective is up, service activemq start on broker
4. mco ping - the node does not respond
5. on node: service mcollective restart
5. mco ping - the node responds

It does seem to require rebooting the node - if you just restart mcollective when activemq is down, it picks up as you would expect when activemq comes back. I have no idea what the difference is.

I tried adding to the server.cfg:
registerinterval = 30

This did not help.

Comment 10 Luke Meyer 2013-10-17 12:42:39 UTC

Well, crap; I can't seem to reliably reproduce this now.

Trying to isolate what exactly the problem is here; here are some experimental results:

1. "chkconfig mcollective off" and reboot the node with activemq down; then start mcollective, then start activemq. Result: the node connected.
2. edit /etc/sysconfig/selinux and set enforcing to permissive; then reboot with activemq down and start activemq after boot. Result: the node connected.

I think there may be an element of timing here as well, i.e. mcollective gives up forever after X minutes of initial failure to connect. Or it could conceivably be a "only the first time mcollective ever attempts to connect and fails" problem. Not sure, but it's definitely still a problem in at least some circumstance.

Comment 11 Brenton Leanhardt 2014-02-04 14:39:31 UTC


*** This bug has been marked as a duplicate of bug 1028382 ***