Bug 1401118 - Broker does not handle mcollective server failover correctly
Summary: Broker does not handle mcollective server failover correctly
Keywords:
Status: CLOSED EOL
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 2.2.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Ashley Hardin
QA Contact: Vikram Goyal
Vikram Goyal
URL:
Whiteboard:
Depends On: 1331398
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-02 21:50 UTC by Rory Thrasher
Modified: 2020-12-14 07:54 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: In rare cases, a broker would successfully make an activemq connection, but immediately run out of memory or otherwise be unable to do anything with that connection. Consequence: The broker will have an open connection with an activemq server that is not functioning. The broker will not properly failover to an alternate activemq server, even if available. Workaround (if any): The best way to workaround this issue is to determine and fix the cause of activemq's failure (probably an out-of-memory error). Setting wrapper.java.additional.8=-Dorg.apache.activemq.UseDedicatedTaskRunner to false in the file `/etc/activemq/wrapper.conf` was able to fix the memory issue. Result: Fixing the problem behind the activemq failure should allow activemq to function normally.
Clone Of: 1331398
Environment:
Last Closed: 2017-05-09 12:33:45 UTC
Target Upstream Version:


Attachments (Terms of Use)

Comment 2 Johnny Liu 2016-12-13 05:46:34 UTC
Seem like wrapper.java.additional.8=-Dorg.apache.activemq.UseDedicatedTaskRunner=true is already set in /etc/activemq/wrapper.conf.

@Rory, what QE need do here?

Comment 3 Rory Thrasher 2016-12-13 19:06:56 UTC
This will only be a documentation change to note this case as a known issue.  I'll make sure a link to the proposed change is posted here when available.  

There are no code changes to verify.

Comment 4 Timothy Williams 2016-12-15 16:25:45 UTC
I'm removing this bug from the release and moving it to CLOSED. This will be documented as a Known Issue in the RHOSE 2.2 asynchronous release notes.

In our testing, we have found that this issue is extremely rare and cannot be avoided with changes to openshift code. Currently, the recommendation is to ensure that activemq servers have adequate available memory with plenty of swap space in case of a spike in usage. OOM conditions on the activemq server are the most likely cause of this issue.

Additionally, it could help to configure stomp to randomly select a server from the pool, rather than always using the first in the list:

  In /opt/rh/root/etc/mcollective/server.cfg on nodes and in
  /opt/rh/root/etc/mcollective/client.cfg on brokers:
    plugin.stomp.pool.randomize = true

Comment 6 Rory Thrasher 2016-12-16 17:24:06 UTC
Andddd reopening.  The original bug was close for this reason.  We'll keep this docs bug open for tracking the docs changes.  It won't be on the 2.2.11 errata, as the docs change can be made whenever.


Note You need to log in before you can comment on or make changes to this bug.