Bug 1065048

Summary:	broker does not handle activemq outages gracefully
Product:	OpenShift Container Platform	Reporter:	Luke Meyer <lmeyer>
Component:	Node	Assignee:	Luke Meyer <lmeyer>
Status:	CLOSED ERRATA	QA Contact:	libra bugs <libra-bugs>
Severity:	medium	Docs Contact:
Priority:	low
Version:	2.0.0	CC:	adellape, bleanhar, charles_sheridan, gpei, libra-onpremise-devel, xiama
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	The MCollective client configuration settings used a default timeout value, which caused the broker to wait for a prolonged period of time when attempting to connect to ActiveMQ. When ActiveMQ was unreachable, the broker waited and eventually failed as if the requests had timed out without displaying helpful error messages. This bug fix updates the client configuration to set a reasonable default timeout value of 6.3 seconds, and broker requests now time out faster and helpful error messages are displayed when ActiveMQ is unreachable. This bug fix configuration change is made only in the installation utility and scripts; for existing installations, an administrator must make the suggested changes manually.	Story Points:	---
Clone Of:	1065047
Clones:	1133958 (view as bug list)		Environment:
Last Closed:	2014-09-11 20:07:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1065047
Bug Blocks:	1133958

Description Luke Meyer 2014-02-13 18:37:07 UTC

+++ This bug was initially created as a clone of Bug #1065047 +++

Description of problem:
When activemq is unavailable (which can happen due to any number of failures: DNS record missing, network broken, port blocked, activemq stopped or crashed...) it appears that the broker sets no timeout in its attempt to reach activemq via MCollective. Thus the user experience is that their API call stalls until httpd times out the request, and they get no useful error message. There isn't even anything in the broker logs to indicate what is going on.

Steps to Reproduce:
1. Create an application "foo"
2. Stop the activemq service
3. Try various commands involving the application, e.g.:
  # rhc app-restart foo
  # rhc cartridge add mysql -a foo
  # rhc app show --gears -a foo
  # rhc app create foo2 ruby-1.9

Actual results:
After several minutes of waiting -
An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server
'https://broker.example.com/broker/rest/domain/demo/application/foo/events'. (and similar)

Expected results:
Timeout after a few seconds when the mco client realizes it can't even connect to activemq and the broker returns a 503 or similar HTTP error code and non-misleading error message e.g. "The service is temporarily unavailable; sorry, please try later." And preferably some nice error messages in the rails log or httpd error_log.

Additional info:
This is similar to the problem that occurs with mco directed requests to a node that is not answering; but that deserves a separate bug as the "activemq down" problem ought to be a lot easier to detect and manage.

Comment 3 Gaoyun Pei 2014-03-24 04:31:57 UTC

Test this with the following packages:
[root@broker ~]# rpm -qa|grep mcollective
ruby193-mcollective-client-2.4.1-3.el6op.noarch
rubygem-openshift-origin-msg-broker-mcollective-1.22.2-1.git.167.c0332d5.el6op.noarch
ruby193-mcollective-common-2.4.1-3.el6op.noarch

[root@node1 ~]# rpm -qa|grep mcollective
ruby193-mcollective-common-2.4.1-3.el6op.noarch
openshift-origin-msg-node-mcollective-1.21.2-1.git.182.5e73e48.el6op.noarch
ruby193-mcollective-2.4.1-3.el6op.noarch




After stop activemq service, try to restart application, still get the same error from client side:
[root@broker conf.d]# rhc app restart app1
Password: ******
An error occurred while communicating with the server. This problem may only be temporary. Check that you
have correctly specified your OpenShift server
'https://broker.ose-201403214.com.cn/broker/rest/application/532fa7ffcfb77f671400003c/events'.


Only mcollective report this error in ruby193-mcollective.log:
E, [2014-03-24T00:13:18.321231 #1084] ERROR -- : activemq.rb:133:in `on_miscerr' Unexpected error on connection stomp://mcollective.com.cn:61613: es_recv: connection.receive returning EOF as nil - resetting connection.


No clear error information about activemq in httpd or broker logs, there was only such error logs in httpd error_log:
[Sun Mar 23 23:52:14 2014] [error] [client 10.66.78.226] (70007)The timeout spec                                                                                        ified has expired: proxy: error reading status line from remote server 127.0.0.1
[Sun Mar 23 23:52:14 2014] [error] [client 10.66.78.226] proxy: Error reading fr                                                                                        om remote server returned by /broker/rest/application/532fa7ffcfb77f671400003c/e                                                                                        vents

Comment 4 Luke Meyer 2014-08-25 23:35:11 UTC

https://github.com/openshift/openshift-extras/pull/440

Changing the installer to set decent defaults for mcollective timeouts. Pre-2.1 code changes allowed the timeout error to be displayed.

I would consider an ose-upgrade automatic modification to mco configuration but at this time I think it may be best just to note the changes made:

broker: add to /opt/rh/ruby193/root/etc/mcollective/client.cfg
# Broker will retry ActiveMQ connection, then report error
plugin.activemq.initial_reconnect_delay = 0.1
plugin.activemq.max_reconnect_attempts = 6

node: add to /opt/rh/ruby193/root/etc/mcollective/server.cfg
# Node should retry connecting to ActiveMQ forever
plugin.activemq.max_reconnect_attempts = 0
plugin.activemq.initial_reconnect_delay = 0.1
plugin.activemq.max_reconnect_delay = 4.0

Comment 5 Ma xiaoqiang 2014-08-26 05:11:48 UTC

Check on puddle [2.1.z/2014-08-25.2]

1. Create an application "phpapp"
2. Stop the activemq service
3. Try various commands involving the application, e.g.:
  # rhc app-restart phpapp
  # rhc cartridge add mysql -a phpapp
  # rhc app show --gears -a phpapp
  # rhc app create rb19 ruby-1.9
The output:
Unable to complete the requested operation due to: Could not connect to ActiveMQ Server: Stomp::Error::MaxReconnectAttempts. Please try again and contact support if the issue persists.
Reference ID: 3cf9f9789921ec0639dd54c4f1a81bb5

Give out useful message.

Comment 9 errata-xmlrpc 2014-09-11 20:07:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1183.html