Bug 1028382

Summary: broker can not send request to node after node is rebooted until restart activemq service.
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: ContainersAssignee: Luke Meyer <lmeyer>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 2.0.0CC: bleanhar, charles_sheridan, cknapp, jialiu, libra-onpremise-devel, lmeyer, mmasters, pep, pruan, wboessen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1038806 (view as bug list) Environment:
Last Closed: 2014-07-07 12:56:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1038806    

Description Johnny Liu 2013-11-08 10:35:37 UTC
Description of problem:
After reboot node, oo-mco ping could find node, and oo-accept-broker is passed, but could not create app, seem from log, looks like broker can not find available node unless restart activemq service.

Version-Release number of selected component (if applicable):
2.0/2013-11-05.1

How reproducible:
Alway

Steps to Reproduce:
1.Setup env using two machines, one is broker, another one is node. activemq is running on broker.
2.Create app successfully.
3.Reboot node.
4.Run oo-accept-broker and oo-accept-node, say they are working well, and "oo-mco ping" could list the node entry.
5.Create one more app.

Actual results:
Fail to create a new one app.
$ rhc app create phpapp php
Using php-5.3 (PHP 5.3) for 'php'

Application Options
-------------------
  Domain:     jialiu
  Cartridges: php-5.3
  Gear Size:  default
  Scaling:    no

Creating application 'phpapp' ... 
An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server
'https://broker.ose.phx2.redhat.com/broker/rest/domain/jialiu/applications'.

Seen from node's mcollective log, seem like it never get request from broker.
Node mcollective log has been attached.

And broker log, get the following error:
2013-11-08 02:06:09.405 [ERROR] The server vm-184-59-4-10.ose.phx2.redhat.com that your application is running on failed to respond in time.  This may be due to a system restart. (pid:6750)


Expected results:
App should be created successfully after node is reboot, especially oo-accept-broker is saying they are woking well.

Additional info:
After the above issue happened, I restart activemq service, and mcollective service, then app could be created now.

Comment 4 Brenton Leanhardt 2013-11-23 14:23:44 UTC
I believe this is being addressed by Bug #1009887.  Please retest with the latest RC.

Comment 5 Johnny Liu 2013-11-25 12:08:34 UTC
Retest this bug against 2.0/2013-11-22.1 puddle, still failed.

Actually this issue is different with Bug #1009887.
In Bug #1009887, activemq (maybe hosted in broker) is reboot.
In this bug, I am rebooting node, NOT activemq host.

In addition, Bug #1009887 should only applied to ose-1.2, ose-2.0 does NOT have such problem. Because even I did not add "registerinterval = 30" for mcollective, reboot activemq service, mcollective service could connect activemq automatically.

Comment 6 Luke Meyer 2013-11-27 22:02:06 UTC
Thank you for clarifying. I'm seeing this problem too. I suspect it has something to do with the changes to topics/queues that we put in config. Given restarting activemq is a bit disruptive and this is probably hard to troubleshoot, I think we need to figure this out ASAP.

Comment 7 Luke Meyer 2013-12-03 19:50:09 UTC
Just following up with what I've figured out so far. I think I understand at least what is happening, if not yet what to do about it.

When the mcollective server on a node connects to activemq, it subscribes itself to topics and queues so that the broker can send it requests. These work by the broker SENDing a message to a topic or queue, and activemq then MESSAGEing that on to the appropriate subscribers. Nodes respond with the same mechanism in reverse, SENDing to a queue that the broker is subscribed to.

I used the "Debugging" section at http://activemq.apache.org/stomp.html to configure logging on my message host so I could watch this in action. The following logs are from that.

Normally, when the node or broker disconnects from activemq, it first unsubscribes. However, when a node is simply rebooted, this doesn't appear to happen - evidently the mcollective server doesn't get a chance to disconnect properly. When it returns, it subscribes again; but as far as activemq knows, the first subscriber is still listening and expecting messages. From activemq's perspective, it's just another subscriber on a different port on the same host, and just doesn't have much to say. You can see this when observing a typical discovery in preparation for creating an app:


2013-12-03 09:20:38,005 [187:33985@61613] TRACE StompIO                        - Received: 
SEND
content-type:text/plain; charset=UTF-8
destination:/topic/mcollective.discovery.agent
content-length:474
reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044

2013-12-03 09:20:38,006 [.16.4.180:16781] TRACE StompIO                        - Sending: 
MESSAGE
content-type:text/plain; charset=UTF-8
message-id:ID:msg.hosts.ose20b.example.com-40814-1386087517634-3:11:-1:1:1
destination:/topic/mcollective.discovery.agent
timestamp:1386087638005
expires:0
content-length:474
priority:4
reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044

2013-12-03 09:20:38,010 [.16.4.180:18245] TRACE StompIO                        - Sending: 
MESSAGE
content-type:text/plain; charset=UTF-8
message-id:ID:msg.hosts.ose20b.example.com-40814-1386087517634-3:11:-1:1:1
destination:/topic/mcollective.discovery.agent
timestamp:1386087638005
expires:0
content-length:474
priority:4
reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044


The first frame is the broker (IP .187) SENDing a message to a topic (which is basically a broadcast to all subscribers). The next two frames are activemq passing that MESSAGE to both subscribers 
- 10.16.4.180:16781 which is the port where the first subscriber connected from
- 10.16.4.180:18245 which is the port mcollective subscribed from after a reboot.
Only one subscriber is actually listening, of course, so there is only one response:


2013-12-03 09:20:38,060 [180:18245@61613] TRACE StompIO                        - Received: 
SEND
content-type:text/plain; charset=UTF-8
destination:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044
content-length:204

2013-12-03 09:20:38,066 [.16.4.187:33985] TRACE StompIO                        - Sending: 
MESSAGE
content-type:text/plain; charset=UTF-8
message-id:ID:msg.hosts.ose20b.example.com-40814-1386087517634-3:10:-1:1:2
destination:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044
timestamp:1386087638060
expires:0
content-length:204
priority:4


For broadcasts like this, there's really no problem, since the node still gets the message and responds. This is why "oo-mco ping" and "oo-mco facts" and such continue working under these conditions. The problem comes with messages directed to specific nodes, which are routinely used once a node has been selected (e.g. for gear creation in this case). Mcollective/ActiveMQ handles these differently.


2013-12-03 09:20:42,974 [187:33985@61613] TRACE StompIO                        - Received: 
SEND
content-type:text/plain; charset=UTF-8
mc_identity:node1.hosts.ose20b.example.com
destination:/queue/mcollective.nodes
content-length:571
reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044

2013-12-03 09:20:42,976 [.16.4.180:16781] TRACE StompIO                        - Sending: 
MESSAGE
content-type:text/plain; charset=UTF-8
message-id:ID:msg.hosts.ose20b.example.com-40814-1386087517634-3:11:-1:1:2
mc_identity:node1.hosts.ose20b.example.com
destination:/queue/mcollective.nodes
timestamp:1386087642974
expires:0
content-length:571
priority:4
reply-to:/queue/mcollective.reply.broker.hosts.ose20b.example.com_6044


You can tell this is a directed message because it includes the "mc_identity" field scoping it to a particular node, and the destination is a queue rather than a topic. Rather than be broadcast, each queue entry can only be consumed by a single subscriber, specified by the field. However, we should note that the hostname is used as the node server_identity, and so there are now *two* subscribers with the identity specified. As it happens, only the first one - the one that's not actually listening any longer (port 16781) - is selected to receive the message (probably because it was first, or because its port is lower). So the node never receives these messages, and there is no response. The broker gives up after 10 seconds and retries from scratch; so if there are other nodes, it will likely pick one of them eventually, otherwise the retries will continue until the broker gives up.

Similar effects will be observed with almost any interaction with existing apps, since they will be directed messages to specific nodes.

The question is what to do about this. Arguably activemq should detect that the subscriber is not really listening (at least after the first non-response) and unsubscribe it. Or perhaps it can be configured to automatically remove subscriptions when another shows up with the same identity. There may even be an upstream bugfix for this. I'm not sure yet what it will be.

Comment 8 Luke Meyer 2013-12-05 20:52:44 UTC
http://projects.puppetlabs.com/issues/23365 indicates that there is a solution in mcollective 2.3.x.

Comment 9 Luke Meyer 2013-12-05 21:41:05 UTC
I created https://trello.com/c/5fuxuI7t to update MCollective, but honestly don't know when that's likely to happen.

For now, the only workaround I know of is to restart ActiveMQ after any node is rebooted.

Comment 13 Brenton Leanhardt 2014-02-04 14:39:31 UTC
*** Bug 906885 has been marked as a duplicate of this bug. ***

Comment 14 Chester Knapp 2014-02-05 21:56:42 UTC
Is there a workaround other than manually restarting mcollective on the node host after each reboot?

Comment 15 Chester Knapp 2014-02-06 14:28:04 UTC
(In reply to Chester Knapp from comment #14)
> Is there a workaround other than manually restarting mcollective on the node
> host after each reboot?

I'm using this workaround to automatically restart mcollective on the node machine after each boot:

[root@nodeX ~]# echo "/etc/init.d/mcollective restart" >> /etc/rc.d/rc.local

No longer seeing this issue with the above workaround in place.

Comment 16 Brenton Leanhardt 2014-02-06 14:44:32 UTC
Chester, I'm not sure you're reproducing the problem exactly as Luke mentioned in Comment #7 though.  The only solution we have today is to restart ActiveMQ when this problem happens.

We are testing a new version of ActiveMQ and we'll test this scenario to see if perhaps the issue can be solve there as well.  We may have to release an admin script to work around the problem because (in my mind) being forced to restart ActiveMQ is a really bad option.

Comment 17 Miciah Dashiel Butler Masters 2014-07-03 18:08:41 UTC
This bug targets OSE 2.1, which currently uses MCollective 2.4.1, which should have the fix for this bug, per comment 8.  I cannot reproduce the bug on an OSE 2.1.2 installation.

Comment 18 Luke Meyer 2014-07-03 18:35:24 UTC
Also with 2.1 we are no longer using direct addressing, which was the only case that manifested the problem. So either way, it should be gone...

Comment 19 Johnny Liu 2014-07-04 08:41:02 UTC
Verified this bug with ruby193-mcollective-2.4.1-5.el6op.noarch, and PASS.

After reboot node, app creation succeed.

Comment 20 Luke Meyer 2014-07-07 12:56:16 UTC
No particular work addressed this bug; the structural change between OSE 2.0 and 2.1 is presumably what removed it.