Bug 906885

Summary: mco ping does not work after initial boot, would work if user restarts mcollective
Product: OpenShift Container Platform Reporter: Peter Ruan <pruan>
Component: NodeAssignee: Jason DeTiberus <jdetiber>
Status: CLOSED DUPLICATE QA Contact: libra bugs <libra-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 1.1.0CC: bfallonf, bleanhar, jdetiber, jpazdziora, libra-onpremise-devel, lmeyer, mmasters, myroslav, xjia
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-04 14:39:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 978492    
Bug Blocks:    

Description Peter Ruan 2013-02-01 19:10:45 UTC
Description of problem:

mco ping does not work after initial boot, would work if user restarts mcollective 


[root@broker ~]# mco ping


---- ping statistics ----
No responses received
[root@broker ~]# /etc/init.d/activemq status
ActiveMQ Broker is running (2097).
[root@broker ~]# /etc/init.d/mcollective restart
Shutting down mcollective:                                 [  OK  ]
Starting mcollective:                                      [  OK  ]
[root@broker ~]# mco ping
broker.example.com                       time=70.11 ms


---- ping statistics ----
1 replies max: 70.11 min: 70.11 avg: 70.11 


Version-Release number of selected component (if applicable):


How reproducible:
always.

Steps to Reproduce:
1. install a desktop version of RHEL6
2. run installation from https://github.com/openshift/enterprise
3. reboot system
4. do 'mco ping'
  
Actual results:
no response

Expected results:
reply from broker node

Additional info:

Comment 7 Brenton Leanhardt 2013-05-30 13:09:27 UTC
I haven't been able to reproduce this though I suspect others have seen this happen based on error reports.  I spent some time looking through the stomp connector code and we are using the default reconnect_delay of 5 seconds in our openshift.sh installation script.  The default is also infinite retries.

Below are my stomp connection settings:

connector = stomp
plugin.stomp.host = activemq.example.com
plugin.stomp.port = 61613
plugin.stomp.user = mcollective
plugin.stomp.password = marionette

That said, Online has switched to using the activemq connector so we will do the same.  An issue such as this could easily be caused by a bug in the stomp connector.  We'll need to retest this scenario once we officially switch connectors.

Also, I don't think the registerinternval is used for what we think: http://docs.puppetlabs.com/mcollective/reference/plugins/registration.html.  I'm also going to remove that setting.

Comment 8 Brenton Leanhardt 2013-08-15 15:08:44 UTC
If this happens again please let us know.  For now we're going to close it.

Comment 9 Luke Meyer 2013-09-20 00:28:20 UTC
I think I can at least give a reproduceable scenario.

1. On broker: service activemq stop
2. Reboot node.
3. After the node returns and mcollective is up, service activemq start on broker
4. mco ping - the node does not respond
5. on node: service mcollective restart
5. mco ping - the node responds

It does seem to require rebooting the node - if you just restart mcollective when activemq is down, it picks up as you would expect when activemq comes back. I have no idea what the difference is.

I tried adding to the server.cfg:
registerinterval = 30

This did not help.

Comment 10 Luke Meyer 2013-10-17 12:42:39 UTC
Well, crap; I can't seem to reliably reproduce this now.

Trying to isolate what exactly the problem is here; here are some experimental results:

1. "chkconfig mcollective off" and reboot the node with activemq down; then start mcollective, then start activemq. Result: the node connected.
2. edit /etc/sysconfig/selinux and set enforcing to permissive; then reboot with activemq down and start activemq after boot. Result: the node connected.

I think there may be an element of timing here as well, i.e. mcollective gives up forever after X minutes of initial failure to connect. Or it could conceivably be a "only the first time mcollective ever attempts to connect and fails" problem. Not sure, but it's definitely still a problem in at least some circumstance.

Comment 11 Brenton Leanhardt 2014-02-04 14:39:31 UTC

*** This bug has been marked as a duplicate of bug 1028382 ***