Bug 1331398 - Broker does not handle mcollective server failover correctly
Summary: Broker does not handle mcollective server failover correctly
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Unknown
Version: 2.2.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Rory Thrasher
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1401118
TreeView+ depends on / blocked
 
Reported: 2016-04-28 13:01 UTC by Evgheni Dereveanchin
Modified: 2020-12-11 12:10 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1401118 (view as bug list)
Environment:
Last Closed: 2018-08-13 15:40:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Evgheni Dereveanchin 2016-04-28 13:01:38 UTC
Description of problem:
If an ActiveMQ server fails during operation, the Broker will not attempt to connect another ActiveMQ cluster member.

Version-Release number of selected component (if applicable):
2.2.6

How reproducible:
Always

Steps to Reproduce:
1. set up mcollective to use ActiveMQ
2. stop ActiveMQ and wrapper process with "kill -STOP" on one of the members
3. run  "oo-mco ping"

Actual results:
The command fails and is stuck

# oo-mco ping
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Facts::Yaml_facts from mcollective/facts/yaml_facts.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin facts_plugin with class MCollective::Facts::Yaml_facts single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Connector::Activemq from mcollective/connector/activemq.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin connector_plugin with class MCollective::Connector::Activemq single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Security::Psk from mcollective/security/psk.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin security_plugin with class MCollective::Security::Psk single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading Mcollective::Registration::Agentlist from mcollective/registration/agentlist.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin registration_plugin with class MCollective::Registration::Agentlist single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:47:in `<<' Registering plugin global_stats with class MCollective::RunnerStats single_instance: true
info 2016/04/11 10:21:38: config.rb:151:in `loadconfig' The Marionette Collective version 2.4.1 started by /opt/rh/ruby193/root/usr/sbin/mco using config file /opt/rh/ruby193/root/etc/mcollective/client.cfg
debug 2016/04/11 10:21:38: pluginmanager.rb:167:in `loadclass' Loading MCollective::Application::Ping from mcollective/application/ping.rb
debug 2016/04/11 10:21:38: pluginmanager.rb:44:in `<<' Registering plugin ping_application with class MCollective::Application::Ping single_instance: true
debug 2016/04/11 10:21:38: pluginmanager.rb:80:in `[]' Returning new plugin ping_application with class MCollective::Application::Ping
debug 2016/04/11 10:21:38: pluginmanager.rb:80:in `[]' Returning new plugin connector_plugin with class MCollective::Connector::Activemq
debug 2016/04/11 10:21:38: pluginmanager.rb:80:in `[]' Returning new plugin security_plugin with class MCollective::Security::Psk
debug 2016/04/11 10:21:38: pluginmanager.rb:83:in `[]' Returning cached plugin global_stats with class MCollective::RunnerStats
debug 2016/04/11 10:21:38: activemq.rb:222:in `block in connect' Adding amq01.demo.lan:61613 to the connection pool
debug 2016/04/11 10:21:38: activemq.rb:222:in `block in connect' Adding amq02.demo.lan:61613 to the connection pool
debug 2016/04/11 10:21:38: activemq.rb:222:in `block in connect' Adding amq03.demo.lan:61613 to the connection pool
info 2016/04/11 10:21:38: activemq.rb:113:in `on_connecting' TCP Connection attempt 0 to stomp://mcollective.lan:61613

Expected results:
Command is able to finish successfully

Additional info:

As it seems, ruby193-rubygem-stomp-1.2.14-1 that ships with Openshift 2.2.6 there is no timeout over the initial CONNECTED frame read.
So the Broker waits forever for a CONNECTED frame from the non-responsive AMQ server that went down.

Connection parameter :connread_timeout has been added in version 1.2.16, it offers users to set a Timeout during CONNECT for read of CONNECTED/ERROR.

Comment 1 Brenton Leanhardt 2016-04-28 13:15:34 UTC
Handing off to Tim for triage.

Comment 10 Rory Thrasher 2016-09-09 21:30:49 UTC
Temporarily moving this to modified to add it to the 2.2.11 errata.  Investigation is still ongoing.

Comment 19 Rory Thrasher 2016-12-15 23:06:07 UTC
Going to close this bug as a WONTFIX.  This will be documented as a Known Issue in the RHOSE 2.2 asynchronous release notes.

In our testing, we have found that this issue is extremely rare and cannot be avoided with changes to openshift code. Currently, the recommendation is to ensure that activemq servers have adequate available memory with plenty of swap space in case of a spike in usage. OOM conditions on the activemq server are the most likely cause of this issue.

Additionally, it could help to configure stomp to randomly select a server from the pool, rather than always using the first in the list:

  In /opt/rh/root/etc/mcollective/server.cfg on nodes and in
  /opt/rh/root/etc/mcollective/client.cfg on brokers:
    plugin.stomp.pool.randomize = true


Note You need to log in before you can comment on or make changes to this bug.