Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1401118

Summary:	Broker does not handle mcollective server failover correctly
Product:	OpenShift Container Platform	Reporter:	Rory Thrasher <rthrashe>
Component:	Documentation	Assignee:	Ashley Hardin <ahardin>
Status:	CLOSED EOL	QA Contact:	Vikram Goyal <vigoyal>
Severity:	urgent	Docs Contact:	Vikram Goyal <vigoyal>
Priority:	high
Version:	2.2.0	CC:	aos-bugs, ederevea, erich, jialiu, jkaur, jokerman, mmccomas, rthrashe, saime, tiwillia
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Cause: In rare cases, a broker would successfully make an activemq connection, but immediately run out of memory or otherwise be unable to do anything with that connection. Consequence: The broker will have an open connection with an activemq server that is not functioning. The broker will not properly failover to an alternate activemq server, even if available. Workaround (if any): The best way to workaround this issue is to determine and fix the cause of activemq's failure (probably an out-of-memory error). Setting wrapper.java.additional.8=-Dorg.apache.activemq.UseDedicatedTaskRunner to false in the file `/etc/activemq/wrapper.conf` was able to fix the memory issue. Result: Fixing the problem behind the activemq failure should allow activemq to function normally.	Story Points:	---
Clone Of:	1331398	Environment:
Last Closed:	2017-05-09 12:33:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1331398
Bug Blocks:

Comment 2 Johnny Liu 2016-12-13 05:46:34 UTC

Seem like wrapper.java.additional.8=-Dorg.apache.activemq.UseDedicatedTaskRunner=true is already set in /etc/activemq/wrapper.conf.

@Rory, what QE need do here?

Comment 3 Rory Thrasher 2016-12-13 19:06:56 UTC

This will only be a documentation change to note this case as a known issue.  I'll make sure a link to the proposed change is posted here when available.  

There are no code changes to verify.

Comment 4 Timothy Williams 2016-12-15 16:25:45 UTC

I'm removing this bug from the release and moving it to CLOSED. This will be documented as a Known Issue in the RHOSE 2.2 asynchronous release notes.

In our testing, we have found that this issue is extremely rare and cannot be avoided with changes to openshift code. Currently, the recommendation is to ensure that activemq servers have adequate available memory with plenty of swap space in case of a spike in usage. OOM conditions on the activemq server are the most likely cause of this issue.

Additionally, it could help to configure stomp to randomly select a server from the pool, rather than always using the first in the list:

  In /opt/rh/root/etc/mcollective/server.cfg on nodes and in
  /opt/rh/root/etc/mcollective/client.cfg on brokers:
    plugin.stomp.pool.randomize = true

Comment 6 Rory Thrasher 2016-12-16 17:24:06 UTC

Andddd reopening.  The original bug was close for this reason.  We'll keep this docs bug open for tracking the docs changes.  It won't be on the 2.2.11 errata, as the docs change can be made whenever.