| Summary: | Broker does not handle mcollective server failover correctly | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Rory Thrasher <rthrashe> |
| Component: | Documentation | Assignee: | Ashley Hardin <ahardin> |
| Status: | CLOSED EOL | QA Contact: | Vikram Goyal <vigoyal> |
| Severity: | urgent | Docs Contact: | Vikram Goyal <vigoyal> |
| Priority: | high | ||
| Version: | 2.2.0 | CC: | aos-bugs, ederevea, erich, jialiu, jkaur, jokerman, mmccomas, rthrashe, saime, tiwillia |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: |
Cause: In rare cases, a broker would successfully make an activemq connection, but immediately run out of memory or otherwise be unable to do anything with that connection.
Consequence: The broker will have an open connection with an activemq server that is not functioning. The broker will not properly failover to an alternate activemq server, even if available.
Workaround (if any): The best way to workaround this issue is to determine and fix the cause of activemq's failure (probably an out-of-memory error). Setting wrapper.java.additional.8=-Dorg.apache.activemq.UseDedicatedTaskRunner to false in the file `/etc/activemq/wrapper.conf` was able to fix the memory issue.
Result: Fixing the problem behind the activemq failure should allow activemq to function normally.
|
Story Points: | --- |
| Clone Of: | 1331398 | Environment: | |
| Last Closed: | 2017-05-09 12:33:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | 1331398 | ||
| Bug Blocks: | |||
|
Comment 2
Johnny Liu
2016-12-13 05:46:34 UTC
This will only be a documentation change to note this case as a known issue. I'll make sure a link to the proposed change is posted here when available. There are no code changes to verify. I'm removing this bug from the release and moving it to CLOSED. This will be documented as a Known Issue in the RHOSE 2.2 asynchronous release notes.
In our testing, we have found that this issue is extremely rare and cannot be avoided with changes to openshift code. Currently, the recommendation is to ensure that activemq servers have adequate available memory with plenty of swap space in case of a spike in usage. OOM conditions on the activemq server are the most likely cause of this issue.
Additionally, it could help to configure stomp to randomly select a server from the pool, rather than always using the first in the list:
In /opt/rh/root/etc/mcollective/server.cfg on nodes and in
/opt/rh/root/etc/mcollective/client.cfg on brokers:
plugin.stomp.pool.randomize = true
Andddd reopening. The original bug was close for this reason. We'll keep this docs bug open for tracking the docs changes. It won't be on the 2.2.11 errata, as the docs change can be made whenever. |