906885 – mco ping does not work after initial boot, would work if user restarts mcollective

Bug 906885 - mco ping does not work after initial boot, would work if user restarts mcollective

Summary: mco ping does not work after initial boot, would work if user restarts mcolle...

Keywords:
Status:	CLOSED DUPLICATE of bug 1028382
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	1.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Jason DeTiberus
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	978492
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-01 19:10 UTC by Peter Ruan
Modified:	2014-02-04 14:39 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-02-04 14:39:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Peter Ruan 2013-02-01 19:10:45 UTC

Description of problem:

mco ping does not work after initial boot, would work if user restarts mcollective 


[root@broker ~]# mco ping


---- ping statistics ----
No responses received
[root@broker ~]# /etc/init.d/activemq status
ActiveMQ Broker is running (2097).
[root@broker ~]# /etc/init.d/mcollective restart
Shutting down mcollective:                                 [  OK  ]
Starting mcollective:                                      [  OK  ]
[root@broker ~]# mco ping
broker.example.com                       time=70.11 ms


---- ping statistics ----
1 replies max: 70.11 min: 70.11 avg: 70.11 


Version-Release number of selected component (if applicable):


How reproducible:
always.

Steps to Reproduce:
1. install a desktop version of RHEL6
2. run installation from https://github.com/openshift/enterprise
3. reboot system
4. do 'mco ping'
  
Actual results:
no response

Expected results:
reply from broker node

Additional info:

Comment 7 Brenton Leanhardt 2013-05-30 13:09:27 UTC

I haven't been able to reproduce this though I suspect others have seen this happen based on error reports.  I spent some time looking through the stomp connector code and we are using the default reconnect_delay of 5 seconds in our openshift.sh installation script.  The default is also infinite retries.

Below are my stomp connection settings:

connector = stomp
plugin.stomp.host = activemq.example.com
plugin.stomp.port = 61613
plugin.stomp.user = mcollective
plugin.stomp.password = marionette

That said, Online has switched to using the activemq connector so we will do the same.  An issue such as this could easily be caused by a bug in the stomp connector.  We'll need to retest this scenario once we officially switch connectors.

Also, I don't think the registerinternval is used for what we think: http://docs.puppetlabs.com/mcollective/reference/plugins/registration.html.  I'm also going to remove that setting.

Comment 8 Brenton Leanhardt 2013-08-15 15:08:44 UTC

If this happens again please let us know.  For now we're going to close it.

Comment 9 Luke Meyer 2013-09-20 00:28:20 UTC

I think I can at least give a reproduceable scenario.

1. On broker: service activemq stop
2. Reboot node.
3. After the node returns and mcollective is up, service activemq start on broker
4. mco ping - the node does not respond
5. on node: service mcollective restart
5. mco ping - the node responds

It does seem to require rebooting the node - if you just restart mcollective when activemq is down, it picks up as you would expect when activemq comes back. I have no idea what the difference is.

I tried adding to the server.cfg:
registerinterval = 30

This did not help.

Comment 10 Luke Meyer 2013-10-17 12:42:39 UTC

Well, crap; I can't seem to reliably reproduce this now.

Trying to isolate what exactly the problem is here; here are some experimental results:

1. "chkconfig mcollective off" and reboot the node with activemq down; then start mcollective, then start activemq. Result: the node connected.
2. edit /etc/sysconfig/selinux and set enforcing to permissive; then reboot with activemq down and start activemq after boot. Result: the node connected.

I think there may be an element of timing here as well, i.e. mcollective gives up forever after X minutes of initial failure to connect. Or it could conceivably be a "only the first time mcollective ever attempts to connect and fails" problem. Not sure, but it's definitely still a problem in at least some circumstance.

Comment 11 Brenton Leanhardt 2014-02-04 14:39:31 UTC


*** This bug has been marked as a duplicate of bug 1028382 ***

Note You need to log in before you can comment on or make changes to this bug.