Red Hat Bugzilla – Bug 1028382
broker can not send request to node after node is rebooted until restart activemq service.
Last modified: 2017-03-08 12:38 EST
Description of problem:
After reboot node, oo-mco ping could find node, and oo-accept-broker is passed, but could not create app, seem from log, looks like broker can not find available node unless restart activemq service.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Setup env using two machines, one is broker, another one is node. activemq is running on broker.
2.Create app successfully.
4.Run oo-accept-broker and oo-accept-node, say they are working well, and "oo-mco ping" could list the node entry.
5.Create one more app.
Fail to create a new one app.
$ rhc app create phpapp php
Using php-5.3 (PHP 5.3) for 'php'
Gear Size: default
Creating application 'phpapp' ...
An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server
Seen from node's mcollective log, seem like it never get request from broker.
Node mcollective log has been attached.
And broker log, get the following error:
2013-11-08 02:06:09.405 [ERROR] The server vm-184-59-4-10.ose.phx2.redhat.com that your application is running on failed to respond in time. This may be due to a system restart. (pid:6750)
App should be created successfully after node is reboot, especially oo-accept-broker is saying they are woking well.
After the above issue happened, I restart activemq service, and mcollective service, then app could be created now.
I believe this is being addressed by Bug #1009887. Please retest with the latest RC.
Retest this bug against 2.0/2013-11-22.1 puddle, still failed.
Actually this issue is different with Bug #1009887.
In Bug #1009887, activemq (maybe hosted in broker) is reboot.
In this bug, I am rebooting node, NOT activemq host.
In addition, Bug #1009887 should only applied to ose-1.2, ose-2.0 does NOT have such problem. Because even I did not add "registerinterval = 30" for mcollective, reboot activemq service, mcollective service could connect activemq automatically.
Thank you for clarifying. I'm seeing this problem too. I suspect it has something to do with the changes to topics/queues that we put in config. Given restarting activemq is a bit disruptive and this is probably hard to troubleshoot, I think we need to figure this out ASAP.
Just following up with what I've figured out so far. I think I understand at least what is happening, if not yet what to do about it.
When the mcollective server on a node connects to activemq, it subscribes itself to topics and queues so that the broker can send it requests. These work by the broker SENDing a message to a topic or queue, and activemq then MESSAGEing that on to the appropriate subscribers. Nodes respond with the same mechanism in reverse, SENDing to a queue that the broker is subscribed to.
I used the "Debugging" section at http://activemq.apache.org/stomp.html to configure logging on my message host so I could watch this in action. The following logs are from that.
Normally, when the node or broker disconnects from activemq, it first unsubscribes. However, when a node is simply rebooted, this doesn't appear to happen - evidently the mcollective server doesn't get a chance to disconnect properly. When it returns, it subscribes again; but as far as activemq knows, the first subscriber is still listening and expecting messages. From activemq's perspective, it's just another subscriber on a different port on the same host, and just doesn't have much to say. You can see this when observing a typical discovery in preparation for creating an app:
2013-12-03 09:20:38,005 [187:33985@61613] TRACE StompIO - Received:
2013-12-03 09:20:38,006 [.16.4.180:16781] TRACE StompIO - Sending:
2013-12-03 09:20:38,010 [.16.4.180:18245] TRACE StompIO - Sending:
The first frame is the broker (IP .187) SENDing a message to a topic (which is basically a broadcast to all subscribers). The next two frames are activemq passing that MESSAGE to both subscribers
- 10.16.4.180:16781 which is the port where the first subscriber connected from
- 10.16.4.180:18245 which is the port mcollective subscribed from after a reboot.
Only one subscriber is actually listening, of course, so there is only one response:
2013-12-03 09:20:38,060 [180:18245@61613] TRACE StompIO - Received:
2013-12-03 09:20:38,066 [.16.4.187:33985] TRACE StompIO - Sending:
For broadcasts like this, there's really no problem, since the node still gets the message and responds. This is why "oo-mco ping" and "oo-mco facts" and such continue working under these conditions. The problem comes with messages directed to specific nodes, which are routinely used once a node has been selected (e.g. for gear creation in this case). Mcollective/ActiveMQ handles these differently.
2013-12-03 09:20:42,974 [187:33985@61613] TRACE StompIO - Received:
2013-12-03 09:20:42,976 [.16.4.180:16781] TRACE StompIO - Sending:
You can tell this is a directed message because it includes the "mc_identity" field scoping it to a particular node, and the destination is a queue rather than a topic. Rather than be broadcast, each queue entry can only be consumed by a single subscriber, specified by the field. However, we should note that the hostname is used as the node server_identity, and so there are now *two* subscribers with the identity specified. As it happens, only the first one - the one that's not actually listening any longer (port 16781) - is selected to receive the message (probably because it was first, or because its port is lower). So the node never receives these messages, and there is no response. The broker gives up after 10 seconds and retries from scratch; so if there are other nodes, it will likely pick one of them eventually, otherwise the retries will continue until the broker gives up.
Similar effects will be observed with almost any interaction with existing apps, since they will be directed messages to specific nodes.
The question is what to do about this. Arguably activemq should detect that the subscriber is not really listening (at least after the first non-response) and unsubscribe it. Or perhaps it can be configured to automatically remove subscriptions when another shows up with the same identity. There may even be an upstream bugfix for this. I'm not sure yet what it will be.
http://projects.puppetlabs.com/issues/23365 indicates that there is a solution in mcollective 2.3.x.
I created https://trello.com/c/5fuxuI7t to update MCollective, but honestly don't know when that's likely to happen.
For now, the only workaround I know of is to restart ActiveMQ after any node is rebooted.
*** Bug 906885 has been marked as a duplicate of this bug. ***
Is there a workaround other than manually restarting mcollective on the node host after each reboot?
(In reply to Chester Knapp from comment #14)
> Is there a workaround other than manually restarting mcollective on the node
> host after each reboot?
I'm using this workaround to automatically restart mcollective on the node machine after each boot:
[root@nodeX ~]# echo "/etc/init.d/mcollective restart" >> /etc/rc.d/rc.local
No longer seeing this issue with the above workaround in place.
Chester, I'm not sure you're reproducing the problem exactly as Luke mentioned in Comment #7 though. The only solution we have today is to restart ActiveMQ when this problem happens.
We are testing a new version of ActiveMQ and we'll test this scenario to see if perhaps the issue can be solve there as well. We may have to release an admin script to work around the problem because (in my mind) being forced to restart ActiveMQ is a really bad option.
This bug targets OSE 2.1, which currently uses MCollective 2.4.1, which should have the fix for this bug, per comment 8. I cannot reproduce the bug on an OSE 2.1.2 installation.
Also with 2.1 we are no longer using direct addressing, which was the only case that manifested the problem. So either way, it should be gone...
Verified this bug with ruby193-mcollective-2.4.1-5.el6op.noarch, and PASS.
After reboot node, app creation succeed.
No particular work addressed this bug; the structural change between OSE 2.0 and 2.1 is presumably what removed it.