Bug 1401118

Summary: Broker does not handle mcollective server failover correctly
Product: OpenShift Container Platform Reporter: Rory Thrasher <rthrashe>
Component: DocumentationAssignee: Ashley Hardin <ahardin>
Status: CLOSED EOL QA Contact: Vikram Goyal <vigoyal>
Severity: urgent Docs Contact: Vikram Goyal <vigoyal>
Priority: high    
Version: 2.2.0CC: aos-bugs, ederevea, erich, jialiu, jkaur, jokerman, mmccomas, rthrashe, saime, tiwillia
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Cause: In rare cases, a broker would successfully make an activemq connection, but immediately run out of memory or otherwise be unable to do anything with that connection. Consequence: The broker will have an open connection with an activemq server that is not functioning. The broker will not properly failover to an alternate activemq server, even if available. Workaround (if any): The best way to workaround this issue is to determine and fix the cause of activemq's failure (probably an out-of-memory error). Setting wrapper.java.additional.8=-Dorg.apache.activemq.UseDedicatedTaskRunner to false in the file `/etc/activemq/wrapper.conf` was able to fix the memory issue. Result: Fixing the problem behind the activemq failure should allow activemq to function normally.
Story Points: ---
Clone Of: 1331398 Environment:
Last Closed: 2017-05-09 12:33:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1331398    
Bug Blocks:    

Comment 2 Johnny Liu 2016-12-13 05:46:34 UTC
Seem like wrapper.java.additional.8=-Dorg.apache.activemq.UseDedicatedTaskRunner=true is already set in /etc/activemq/wrapper.conf.

@Rory, what QE need do here?

Comment 3 Rory Thrasher 2016-12-13 19:06:56 UTC
This will only be a documentation change to note this case as a known issue.  I'll make sure a link to the proposed change is posted here when available.  

There are no code changes to verify.

Comment 4 Timothy Williams 2016-12-15 16:25:45 UTC
I'm removing this bug from the release and moving it to CLOSED. This will be documented as a Known Issue in the RHOSE 2.2 asynchronous release notes.

In our testing, we have found that this issue is extremely rare and cannot be avoided with changes to openshift code. Currently, the recommendation is to ensure that activemq servers have adequate available memory with plenty of swap space in case of a spike in usage. OOM conditions on the activemq server are the most likely cause of this issue.

Additionally, it could help to configure stomp to randomly select a server from the pool, rather than always using the first in the list:

  In /opt/rh/root/etc/mcollective/server.cfg on nodes and in
  /opt/rh/root/etc/mcollective/client.cfg on brokers:
    plugin.stomp.pool.randomize = true

Comment 6 Rory Thrasher 2016-12-16 17:24:06 UTC
Andddd reopening.  The original bug was close for this reason.  We'll keep this docs bug open for tracking the docs changes.  It won't be on the 2.2.11 errata, as the docs change can be made whenever.