Bug 1331398

Summary: Broker does not handle mcollective server failover correctly
Product: OpenShift Container Platform Reporter: Evgheni Dereveanchin <ederevea>
Component: UnknownAssignee: Rory Thrasher <rthrashe>
Status: CLOSED WONTFIX QA Contact: Johnny Liu <jialiu>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.2.0CC: aos-bugs, bleanhar, erich, fcami, jkaur, jokerman, mmccomas, rthrashe, saime, trogers
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1401118 (view as bug list) Environment:
Last Closed: 2018-08-13 15:40:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1401118    

Description Evgheni Dereveanchin 2016-04-28 13:01:38 UTC
Description of problem:
If an ActiveMQ server fails during operation, the Broker will not attempt to connect another ActiveMQ cluster member.

Version-Release number of selected component (if applicable):
2.2.6

How reproducible:
Always

Steps to Reproduce:
1. set up mcollective to use ActiveMQ
2. stop ActiveMQ and wrapper process with "kill -STOP" on one of the members
3. run  "oo-mco ping"

Actual results:
The command fails and is stuck

# oo-mco ping
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Facts::Yaml_facts from mcollective/facts/yaml_facts.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin facts_plugin with class MCollective::Facts::Yaml_facts single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Connector::Activemq from mcollective/connector/activemq.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin connector_plugin with class MCollective::Connector::Activemq single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Security::Psk from mcollective/security/psk.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin security_plugin with class MCollective::Security::Psk single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Registration::Agentlist from mcollective/registration/agentlist.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin registration_plugin with class MCollective::Registration::Agentlist single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:47:in `<<' Registering plugin global_stats with class MCollective::RunnerStats single_instance: true
info 2016/04/11 10:21:38: config.rb:151:in `loadconfig' The Marionette Collective version 2.4.1 started by /opt/rh/ruby193/root/usr/sbin/mco using config file /opt/rh/ruby193/root/etc/mcollective/client.cfg
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading MCollective::Application::Ping from mcollective/application/ping.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin ping_application with class MCollective::Application::Ping single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:80:in `[]' Returning new plugin ping_application with class MCollective::Application::Ping
debug 2016/04/11 10:21:38: pluginmanager.rb:80:in `[]' Returning new plugin connector_plugin with class MCollective::Connector::Activemq
debug 2016/04/11 10:21:38: pluginmanager.rb:80:in `[]' Returning new plugin security_plugin with class MCollective::Security::Psk
debug 2016/04/11 10:21:38: pluginmanager.rb:83:in `[]' Returning cached plugin global_stats with class MCollective::RunnerStats
debug 2016/04/11 10:21:38: activemq.rb:222:in `block in connect' Adding amq01.demo.lan:61613 to the connection pool
debug 2016/04/11 10:21:38: activemq.rb:222:in `block in connect' Adding amq02.demo.lan:61613 to the connection pool
debug 2016/04/11 10:21:38: activemq.rb:222:in `block in connect' Adding amq03.demo.lan:61613 to the connection pool
info 2016/04/11 10:21:38: activemq.rb:113:in `on_connecting' TCP Connection attempt 0 to stomp://mcollective.lan:61613

Expected results:
Command is able to finish successfully

Additional info:

As it seems, ruby193-rubygem-stomp-1.2.14-1 that ships with Openshift 2.2.6 there is no timeout over the initial CONNECTED frame read.
So the Broker waits forever for a CONNECTED frame from the non-responsive AMQ server that went down.

Connection parameter :connread_timeout has been added in version 1.2.16, it offers users to set a Timeout during CONNECT for read of CONNECTED/ERROR.

Comment 1 Brenton Leanhardt 2016-04-28 13:15:34 UTC
Handing off to Tim for triage.

Comment 10 Rory Thrasher 2016-09-09 21:30:49 UTC
Temporarily moving this to modified to add it to the 2.2.11 errata.  Investigation is still ongoing.

Comment 19 Rory Thrasher 2016-12-15 23:06:07 UTC
Going to close this bug as a WONTFIX.  This will be documented as a Known Issue in the RHOSE 2.2 asynchronous release notes.

In our testing, we have found that this issue is extremely rare and cannot be avoided with changes to openshift code. Currently, the recommendation is to ensure that activemq servers have adequate available memory with plenty of swap space in case of a spike in usage. OOM conditions on the activemq server are the most likely cause of this issue.

Additionally, it could help to configure stomp to randomly select a server from the pool, rather than always using the first in the list:

  In /opt/rh/root/etc/mcollective/server.cfg on nodes and in
  /opt/rh/root/etc/mcollective/client.cfg on brokers:
    plugin.stomp.pool.randomize = true