Description of problem: If an ActiveMQ server fails during operation, the Broker will not attempt to connect another ActiveMQ cluster member. Version-Release number of selected component (if applicable): 2.2.6 How reproducible: Always Steps to Reproduce: 1. set up mcollective to use ActiveMQ 2. stop ActiveMQ and wrapper process with "kill -STOP" on one of the members 3. run "oo-mco ping" Actual results: The command fails and is stuck # oo-mco ping debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Facts::Yaml_facts from mcollective/facts/yaml_facts.rb debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin facts_plugin with class MCollective::Facts::Yaml_facts single_instance: true debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Connector::Activemq from mcollective/connector/activemq.rb debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin connector_plugin with class MCollective::Connector::Activemq single_instance: true debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Security::Psk from mcollective/security/psk.rb debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin security_plugin with class MCollective::Security::Psk single_instance: true debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Registration::Agentlist from mcollective/registration/agentlist.rb debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin registration_plugin with class MCollective::Registration::Agentlist single_instance: true debug 2016/04/11 10:21:38: pluginmanager.rb:47:in `<<' Registering plugin global_stats with class MCollective::RunnerStats single_instance: true info 2016/04/11 10:21:38: config.rb:151:in `loadconfig' The Marionette Collective version 2.4.1 started by /opt/rh/ruby193/root/usr/sbin/mco using config file /opt/rh/ruby193/root/etc/mcollective/client.cfg debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading MCollective::Application::Ping from mcollective/application/ping.rb debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin ping_application with class MCollective::Application::Ping single_instance: true debug 2016/04/11 10:21:38: pluginmanager.rb:80:in `[]' Returning new plugin ping_application with class MCollective::Application::Ping debug 2016/04/11 10:21:38: pluginmanager.rb:80:in `[]' Returning new plugin connector_plugin with class MCollective::Connector::Activemq debug 2016/04/11 10:21:38: pluginmanager.rb:80:in `[]' Returning new plugin security_plugin with class MCollective::Security::Psk debug 2016/04/11 10:21:38: pluginmanager.rb:83:in `[]' Returning cached plugin global_stats with class MCollective::RunnerStats debug 2016/04/11 10:21:38: activemq.rb:222:in `block in connect' Adding amq01.demo.lan:61613 to the connection pool debug 2016/04/11 10:21:38: activemq.rb:222:in `block in connect' Adding amq02.demo.lan:61613 to the connection pool debug 2016/04/11 10:21:38: activemq.rb:222:in `block in connect' Adding amq03.demo.lan:61613 to the connection pool info 2016/04/11 10:21:38: activemq.rb:113:in `on_connecting' TCP Connection attempt 0 to stomp://mcollective.lan:61613 Expected results: Command is able to finish successfully Additional info: As it seems, ruby193-rubygem-stomp-1.2.14-1 that ships with Openshift 2.2.6 there is no timeout over the initial CONNECTED frame read. So the Broker waits forever for a CONNECTED frame from the non-responsive AMQ server that went down. Connection parameter :connread_timeout has been added in version 1.2.16, it offers users to set a Timeout during CONNECT for read of CONNECTED/ERROR.
Handing off to Tim for triage.
Temporarily moving this to modified to add it to the 2.2.11 errata. Investigation is still ongoing.
Going to close this bug as a WONTFIX. This will be documented as a Known Issue in the RHOSE 2.2 asynchronous release notes. In our testing, we have found that this issue is extremely rare and cannot be avoided with changes to openshift code. Currently, the recommendation is to ensure that activemq servers have adequate available memory with plenty of swap space in case of a spike in usage. OOM conditions on the activemq server are the most likely cause of this issue. Additionally, it could help to configure stomp to randomly select a server from the pool, rather than always using the first in the list: In /opt/rh/root/etc/mcollective/server.cfg on nodes and in /opt/rh/root/etc/mcollective/client.cfg on brokers: plugin.stomp.pool.randomize = true