Bug 906885
Summary: | mco ping does not work after initial boot, would work if user restarts mcollective | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Peter Ruan <pruan> |
Component: | Node | Assignee: | Jason DeTiberus <jdetiber> |
Status: | CLOSED DUPLICATE | QA Contact: | libra bugs <libra-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 1.1.0 | CC: | bfallonf, bleanhar, jdetiber, jpazdziora, libra-onpremise-devel, lmeyer, mmasters, myroslav, xjia |
Target Milestone: | --- | Keywords: | Reopened, Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-02-04 14:39:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 978492 | ||
Bug Blocks: |
Description
Peter Ruan
2013-02-01 19:10:45 UTC
I haven't been able to reproduce this though I suspect others have seen this happen based on error reports. I spent some time looking through the stomp connector code and we are using the default reconnect_delay of 5 seconds in our openshift.sh installation script. The default is also infinite retries. Below are my stomp connection settings: connector = stomp plugin.stomp.host = activemq.example.com plugin.stomp.port = 61613 plugin.stomp.user = mcollective plugin.stomp.password = marionette That said, Online has switched to using the activemq connector so we will do the same. An issue such as this could easily be caused by a bug in the stomp connector. We'll need to retest this scenario once we officially switch connectors. Also, I don't think the registerinternval is used for what we think: http://docs.puppetlabs.com/mcollective/reference/plugins/registration.html. I'm also going to remove that setting. If this happens again please let us know. For now we're going to close it. I think I can at least give a reproduceable scenario. 1. On broker: service activemq stop 2. Reboot node. 3. After the node returns and mcollective is up, service activemq start on broker 4. mco ping - the node does not respond 5. on node: service mcollective restart 5. mco ping - the node responds It does seem to require rebooting the node - if you just restart mcollective when activemq is down, it picks up as you would expect when activemq comes back. I have no idea what the difference is. I tried adding to the server.cfg: registerinterval = 30 This did not help. Well, crap; I can't seem to reliably reproduce this now. Trying to isolate what exactly the problem is here; here are some experimental results: 1. "chkconfig mcollective off" and reboot the node with activemq down; then start mcollective, then start activemq. Result: the node connected. 2. edit /etc/sysconfig/selinux and set enforcing to permissive; then reboot with activemq down and start activemq after boot. Result: the node connected. I think there may be an element of timing here as well, i.e. mcollective gives up forever after X minutes of initial failure to connect. Or it could conceivably be a "only the first time mcollective ever attempts to connect and fails" problem. Not sure, but it's definitely still a problem in at least some circumstance. *** This bug has been marked as a duplicate of bug 1028382 *** |