Description of problem: mco ping does not work after initial boot, would work if user restarts mcollective [root@broker ~]# mco ping ---- ping statistics ---- No responses received [root@broker ~]# /etc/init.d/activemq status ActiveMQ Broker is running (2097). [root@broker ~]# /etc/init.d/mcollective restart Shutting down mcollective: [ OK ] Starting mcollective: [ OK ] [root@broker ~]# mco ping broker.example.com time=70.11 ms ---- ping statistics ---- 1 replies max: 70.11 min: 70.11 avg: 70.11 Version-Release number of selected component (if applicable): How reproducible: always. Steps to Reproduce: 1. install a desktop version of RHEL6 2. run installation from https://github.com/openshift/enterprise 3. reboot system 4. do 'mco ping' Actual results: no response Expected results: reply from broker node Additional info:
I haven't been able to reproduce this though I suspect others have seen this happen based on error reports. I spent some time looking through the stomp connector code and we are using the default reconnect_delay of 5 seconds in our openshift.sh installation script. The default is also infinite retries. Below are my stomp connection settings: connector = stomp plugin.stomp.host = activemq.example.com plugin.stomp.port = 61613 plugin.stomp.user = mcollective plugin.stomp.password = marionette That said, Online has switched to using the activemq connector so we will do the same. An issue such as this could easily be caused by a bug in the stomp connector. We'll need to retest this scenario once we officially switch connectors. Also, I don't think the registerinternval is used for what we think: http://docs.puppetlabs.com/mcollective/reference/plugins/registration.html. I'm also going to remove that setting.
If this happens again please let us know. For now we're going to close it.
I think I can at least give a reproduceable scenario. 1. On broker: service activemq stop 2. Reboot node. 3. After the node returns and mcollective is up, service activemq start on broker 4. mco ping - the node does not respond 5. on node: service mcollective restart 5. mco ping - the node responds It does seem to require rebooting the node - if you just restart mcollective when activemq is down, it picks up as you would expect when activemq comes back. I have no idea what the difference is. I tried adding to the server.cfg: registerinterval = 30 This did not help.
Well, crap; I can't seem to reliably reproduce this now. Trying to isolate what exactly the problem is here; here are some experimental results: 1. "chkconfig mcollective off" and reboot the node with activemq down; then start mcollective, then start activemq. Result: the node connected. 2. edit /etc/sysconfig/selinux and set enforcing to permissive; then reboot with activemq down and start activemq after boot. Result: the node connected. I think there may be an element of timing here as well, i.e. mcollective gives up forever after X minutes of initial failure to connect. Or it could conceivably be a "only the first time mcollective ever attempts to connect and fails" problem. Not sure, but it's definitely still a problem in at least some circumstance.
*** This bug has been marked as a duplicate of bug 1028382 ***