Bug 1028382
Summary: | broker can not send request to node after node is rebooted until restart activemq service. | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Johnny Liu <jialiu> | |
Component: | Containers | Assignee: | Luke Meyer <lmeyer> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | libra bugs <libra-bugs> | |
Severity: | medium | Docs Contact: | ||
Priority: | low | |||
Version: | 2.0.0 | CC: | bleanhar, charles_sheridan, cknapp, jialiu, libra-onpremise-devel, lmeyer, mmasters, pep, pruan, wboessen | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1038806 (view as bug list) | Environment: | ||
Last Closed: | 2014-07-07 12:56:16 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1038806 |
Description
Johnny Liu
2013-11-08 10:35:37 UTC
I believe this is being addressed by Bug #1009887. Please retest with the latest RC. Retest this bug against 2.0/2013-11-22.1 puddle, still failed. Actually this issue is different with Bug #1009887. In Bug #1009887, activemq (maybe hosted in broker) is reboot. In this bug, I am rebooting node, NOT activemq host. In addition, Bug #1009887 should only applied to ose-1.2, ose-2.0 does NOT have such problem. Because even I did not add "registerinterval = 30" for mcollective, reboot activemq service, mcollective service could connect activemq automatically. Thank you for clarifying. I'm seeing this problem too. I suspect it has something to do with the changes to topics/queues that we put in config. Given restarting activemq is a bit disruptive and this is probably hard to troubleshoot, I think we need to figure this out ASAP. Just following up with what I've figured out so far. I think I understand at least what is happening, if not yet what to do about it. When the mcollective server on a node connects to activemq, it subscribes itself to topics and queues so that the broker can send it requests. These work by the broker SENDing a message to a topic or queue, and activemq then MESSAGEing that on to the appropriate subscribers. Nodes respond with the same mechanism in reverse, SENDing to a queue that the broker is subscribed to. I used the "Debugging" section at http://activemq.apache.org/stomp.html to configure logging on my message host so I could watch this in action. The following logs are from that. Normally, when the node or broker disconnects from activemq, it first unsubscribes. However, when a node is simply rebooted, this doesn't appear to happen - evidently the mcollective server doesn't get a chance to disconnect properly. When it returns, it subscribes again; but as far as activemq knows, the first subscriber is still listening and expecting messages. From activemq's perspective, it's just another subscriber on a different port on the same host, and just doesn't have much to say. You can see this when observing a typical discovery in preparation for creating an app: 2013-12-03 09:20:38,005 [187:33985@61613] TRACE StompIO - Received: SEND content-type:text/plain; charset=UTF-8 destination:/topic/mcollective.discovery.agent content-length:474 reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044 2013-12-03 09:20:38,006 [.16.4.180:16781] TRACE StompIO - Sending: MESSAGE content-type:text/plain; charset=UTF-8 message-id:ID:msg.hosts.ose20b.example.com-40814-1386087517634-3:11:-1:1:1 destination:/topic/mcollective.discovery.agent timestamp:1386087638005 expires:0 content-length:474 priority:4 reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044 2013-12-03 09:20:38,010 [.16.4.180:18245] TRACE StompIO - Sending: MESSAGE content-type:text/plain; charset=UTF-8 message-id:ID:msg.hosts.ose20b.example.com-40814-1386087517634-3:11:-1:1:1 destination:/topic/mcollective.discovery.agent timestamp:1386087638005 expires:0 content-length:474 priority:4 reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044 The first frame is the broker (IP .187) SENDing a message to a topic (which is basically a broadcast to all subscribers). The next two frames are activemq passing that MESSAGE to both subscribers - 10.16.4.180:16781 which is the port where the first subscriber connected from - 10.16.4.180:18245 which is the port mcollective subscribed from after a reboot. Only one subscriber is actually listening, of course, so there is only one response: 2013-12-03 09:20:38,060 [180:18245@61613] TRACE StompIO - Received: SEND content-type:text/plain; charset=UTF-8 destination:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044 content-length:204 2013-12-03 09:20:38,066 [.16.4.187:33985] TRACE StompIO - Sending: MESSAGE content-type:text/plain; charset=UTF-8 message-id:ID:msg.hosts.ose20b.example.com-40814-1386087517634-3:10:-1:1:2 destination:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044 timestamp:1386087638060 expires:0 content-length:204 priority:4 For broadcasts like this, there's really no problem, since the node still gets the message and responds. This is why "oo-mco ping" and "oo-mco facts" and such continue working under these conditions. The problem comes with messages directed to specific nodes, which are routinely used once a node has been selected (e.g. for gear creation in this case). Mcollective/ActiveMQ handles these differently. 2013-12-03 09:20:42,974 [187:33985@61613] TRACE StompIO - Received: SEND content-type:text/plain; charset=UTF-8 mc_identity:node1.hosts.ose20b.example.com destination:/queue/mcollective.nodes content-length:571 reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044 2013-12-03 09:20:42,976 [.16.4.180:16781] TRACE StompIO - Sending: MESSAGE content-type:text/plain; charset=UTF-8 message-id:ID:msg.hosts.ose20b.example.com-40814-1386087517634-3:11:-1:1:2 mc_identity:node1.hosts.ose20b.example.com destination:/queue/mcollective.nodes timestamp:1386087642974 expires:0 content-length:571 priority:4 reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044 You can tell this is a directed message because it includes the "mc_identity" field scoping it to a particular node, and the destination is a queue rather than a topic. Rather than be broadcast, each queue entry can only be consumed by a single subscriber, specified by the field. However, we should note that the hostname is used as the node server_identity, and so there are now *two* subscribers with the identity specified. As it happens, only the first one - the one that's not actually listening any longer (port 16781) - is selected to receive the message (probably because it was first, or because its port is lower). So the node never receives these messages, and there is no response. The broker gives up after 10 seconds and retries from scratch; so if there are other nodes, it will likely pick one of them eventually, otherwise the retries will continue until the broker gives up. Similar effects will be observed with almost any interaction with existing apps, since they will be directed messages to specific nodes. The question is what to do about this. Arguably activemq should detect that the subscriber is not really listening (at least after the first non-response) and unsubscribe it. Or perhaps it can be configured to automatically remove subscriptions when another shows up with the same identity. There may even be an upstream bugfix for this. I'm not sure yet what it will be. http://projects.puppetlabs.com/issues/23365 indicates that there is a solution in mcollective 2.3.x. I created https://trello.com/c/5fuxuI7t to update MCollective, but honestly don't know when that's likely to happen. For now, the only workaround I know of is to restart ActiveMQ after any node is rebooted. *** Bug 906885 has been marked as a duplicate of this bug. *** Is there a workaround other than manually restarting mcollective on the node host after each reboot? (In reply to Chester Knapp from comment #14) > Is there a workaround other than manually restarting mcollective on the node > host after each reboot? I'm using this workaround to automatically restart mcollective on the node machine after each boot: [root@nodeX ~]# echo "/etc/init.d/mcollective restart" >> /etc/rc.d/rc.local No longer seeing this issue with the above workaround in place. Chester, I'm not sure you're reproducing the problem exactly as Luke mentioned in Comment #7 though. The only solution we have today is to restart ActiveMQ when this problem happens. We are testing a new version of ActiveMQ and we'll test this scenario to see if perhaps the issue can be solve there as well. We may have to release an admin script to work around the problem because (in my mind) being forced to restart ActiveMQ is a really bad option. This bug targets OSE 2.1, which currently uses MCollective 2.4.1, which should have the fix for this bug, per comment 8. I cannot reproduce the bug on an OSE 2.1.2 installation. Also with 2.1 we are no longer using direct addressing, which was the only case that manifested the problem. So either way, it should be gone... Verified this bug with ruby193-mcollective-2.4.1-5.el6op.noarch, and PASS. After reboot node, app creation succeed. No particular work addressed this bug; the structural change between OSE 2.0 and 2.1 is presumably what removed it. |