1065048 – broker does not handle activemq outages gracefully

Bug 1065048 - broker does not handle activemq outages gracefully

Summary: broker does not handle activemq outages gracefully

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	2.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Luke Meyer
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	1065047
Blocks:	1133958
TreeView+	depends on / blocked

Reported:	2014-02-13 18:37 UTC by Luke Meyer
Modified:	2014-09-11 20:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	The MCollective client configuration settings used a default timeout value, which caused the broker to wait for a prolonged period of time when attempting to connect to ActiveMQ. When ActiveMQ was unreachable, the broker waited and eventually failed as if the requests had timed out without displaying helpful error messages. This bug fix updates the client configuration to set a reasonable default timeout value of 6.3 seconds, and broker requests now time out faster and helpful error messages are displayed when ActiveMQ is unreachable. This bug fix configuration change is made only in the installation utility and scripts; for existing installations, an administrator must make the suggested changes manually.
Clone Of:	1065047
Clones:	1133958 (view as bug list)
Environment:
Last Closed:	2014-09-11 20:07:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1183	0	normal	SHIPPED_LIVE	Red Hat OpenShift Enterprise 2.1.6 bug fix and enhancement update	2014-09-12 00:06:21 UTC

Description Luke Meyer 2014-02-13 18:37:07 UTC

+++ This bug was initially created as a clone of Bug #1065047 +++

Description of problem:
When activemq is unavailable (which can happen due to any number of failures: DNS record missing, network broken, port blocked, activemq stopped or crashed...) it appears that the broker sets no timeout in its attempt to reach activemq via MCollective. Thus the user experience is that their API call stalls until httpd times out the request, and they get no useful error message. There isn't even anything in the broker logs to indicate what is going on.

Steps to Reproduce:
1. Create an application "foo"
2. Stop the activemq service
3. Try various commands involving the application, e.g.:
  # rhc app-restart foo
  # rhc cartridge add mysql -a foo
  # rhc app show --gears -a foo
  # rhc app create foo2 ruby-1.9

Actual results:
After several minutes of waiting -
An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server
'https://broker.example.com/broker/rest/domain/demo/application/foo/events'. (and similar)

Expected results:
Timeout after a few seconds when the mco client realizes it can't even connect to activemq and the broker returns a 503 or similar HTTP error code and non-misleading error message e.g. "The service is temporarily unavailable; sorry, please try later." And preferably some nice error messages in the rails log or httpd error_log.

Additional info:
This is similar to the problem that occurs with mco directed requests to a node that is not answering; but that deserves a separate bug as the "activemq down" problem ought to be a lot easier to detect and manage.

Comment 3 Gaoyun Pei 2014-03-24 04:31:57 UTC

Test this with the following packages:
[root@broker ~]# rpm -qa|grep mcollective
ruby193-mcollective-client-2.4.1-3.el6op.noarch
rubygem-openshift-origin-msg-broker-mcollective-1.22.2-1.git.167.c0332d5.el6op.noarch
ruby193-mcollective-common-2.4.1-3.el6op.noarch

[root@node1 ~]# rpm -qa|grep mcollective
ruby193-mcollective-common-2.4.1-3.el6op.noarch
openshift-origin-msg-node-mcollective-1.21.2-1.git.182.5e73e48.el6op.noarch
ruby193-mcollective-2.4.1-3.el6op.noarch




After stop activemq service, try to restart application, still get the same error from client side:
[root@broker conf.d]# rhc app restart app1
Password: ******
An error occurred while communicating with the server. This problem may only be temporary. Check that you
have correctly specified your OpenShift server
'https://broker.ose-201403214.com.cn/broker/rest/application/532fa7ffcfb77f671400003c/events'.


Only mcollective report this error in ruby193-mcollective.log:
E, [2014-03-24T00:13:18.321231 #1084] ERROR -- : activemq.rb:133:in `on_miscerr' Unexpected error on connection stomp://mcollective.com.cn:61613: es_recv: connection.receive returning EOF as nil - resetting connection.


No clear error information about activemq in httpd or broker logs, there was only such error logs in httpd error_log:
[Sun Mar 23 23:52:14 2014] [error] [client 10.66.78.226] (70007)The timeout spec                                                                                        ified has expired: proxy: error reading status line from remote server 127.0.0.1
[Sun Mar 23 23:52:14 2014] [error] [client 10.66.78.226] proxy: Error reading fr                                                                                        om remote server returned by /broker/rest/application/532fa7ffcfb77f671400003c/e                                                                                        vents

Comment 4 Luke Meyer 2014-08-25 23:35:11 UTC

https://github.com/openshift/openshift-extras/pull/440

Changing the installer to set decent defaults for mcollective timeouts. Pre-2.1 code changes allowed the timeout error to be displayed.

I would consider an ose-upgrade automatic modification to mco configuration but at this time I think it may be best just to note the changes made:

broker: add to /opt/rh/ruby193/root/etc/mcollective/client.cfg
# Broker will retry ActiveMQ connection, then report error
plugin.activemq.initial_reconnect_delay = 0.1
plugin.activemq.max_reconnect_attempts = 6

node: add to /opt/rh/ruby193/root/etc/mcollective/server.cfg
# Node should retry connecting to ActiveMQ forever
plugin.activemq.max_reconnect_attempts = 0
plugin.activemq.initial_reconnect_delay = 0.1
plugin.activemq.max_reconnect_delay = 4.0

Comment 5 Ma xiaoqiang 2014-08-26 05:11:48 UTC

Check on puddle [2.1.z/2014-08-25.2]

1. Create an application "phpapp"
2. Stop the activemq service
3. Try various commands involving the application, e.g.:
  # rhc app-restart phpapp
  # rhc cartridge add mysql -a phpapp
  # rhc app show --gears -a phpapp
  # rhc app create rb19 ruby-1.9
The output:
Unable to complete the requested operation due to: Could not connect to ActiveMQ Server: Stomp::Error::MaxReconnectAttempts. Please try again and contact support if the issue persists.
Reference ID: 3cf9f9789921ec0639dd54c4f1a81bb5

Give out useful message.

Comment 9 errata-xmlrpc 2014-09-11 20:07:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1183.html

Note You need to log in before you can comment on or make changes to this bug.