1065047 – broker does not handle activemq outages gracefully

Bug 1065047 - broker does not handle activemq outages gracefully

Summary: broker does not handle activemq outages gracefully

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Pod
Sub Component:
Version:	1.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Lili Nader
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1065048
TreeView+	depends on / blocked

Reported:	2014-02-13 18:28 UTC by Luke Meyer
Modified:	2014-10-26 20:42 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1065048 (view as bug list)
Environment:
Last Closed:	2014-05-15 15:28:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Luke Meyer 2014-02-13 18:28:14 UTC

Description of problem:
When activemq is unavailable (which can happen due to any number of failures: DNS record missing, network broken, port blocked, activemq stopped or crashed...) it appears that the broker sets no timeout in its attempt to reach activemq via MCollective. Thus the user experience is that their API call stalls until httpd times out the request, and they get no useful error message. There isn't even anything in the broker logs to indicate what is going on.

Steps to Reproduce:
1. Create an application "foo"
2. Stop the activemq service
3. Try various commands involving the application, e.g.:
# rhc app-restart foo
# rhc cartridge add mysql -a foo
# rhc app show --gears -a foo
# rhc app create foo2 ruby-1.9

Actual results:
After several minutes of waiting -
An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server
'https://broker.example.com/broker/rest/domain/demo/application/foo/events'. (and similar)

Expected results:
Timeout after a few seconds when the mco client realizes it can't even connect to activemq and the broker returns a 503 or similar HTTP error code and non-misleading error message e.g. "The service is temporarily unavailable; sorry, please try later." And preferably some nice error messages in the rails log or httpd error_log.

Additional info:
This is similar to the problem that occurs with mco directed requests to a node that is not answering; but that deserves a separate bug as the "activemq down" problem ought to be a lot easier to detect and manage.

Comment 1 Lili Nader 2014-04-09 20:12:49 UTC

It seems like the client tries indefinitely ( I gave up waiting after 147 attempts) to connect

I, [2014-04-09T15:02:22.297553 #3298]  INFO -- : activemq.rb:113:in `on_connecting' TCP Connection attempt 0 to stomp://mcollective:6163
I, [2014-04-09T15:02:22.299048 #3298]  INFO -- : activemq.rb:128:in `on_connectfail' TCP Connection to stomp://mcollective:6163 failed on attempt 0
.
.
.
I, [2014-04-09T16:07:03.439041 #3298]  INFO -- : activemq.rb:113:in `on_connecting' TCP Connection attempt 140 to stomp://mcollective:6163
I, [2014-04-09T16:07:03.439698 #3298]  INFO -- : activemq.rb:128:in `on_connectfail' TCP Connection to stomp://mcollective:6163 failed on attempt 140 

There are ways to configure how many times it attempts to connect before returning an error according to

https://github.com/puppetlabs/marionette-collective/blob/master/plugins/mcollective/connector/activemq.rb#L233

I'm looking into how and where I can configure these settings.

Comment 2 Lili Nader 2014-04-16 01:28:45 UTC

By setting the value

plugin.activemq.max_reconnect_attempts = 0

in mcollective server.cfg you can limit the number of attempts.  However, once the reconnect attempts have been exhausted, even if the activemq comes up the mcollective server does not try to reconnect and has to be restarted.

Ideally, we would like the mcollective to continue to retry by let the broker know within certain time limit that it cannot connect, so the broker can relay the message back to the client.

Now looking into mcollective configuration to see what can be done.

Comment 3 Lili Nader 2014-04-16 06:07:34 UTC

So there are 2 config files for mcollective: server.cfg and client.cfg

By setting the value of plugin.activemq.max_reconnect_attempts to non zero value in client.cfg we can get the desired outcome.  i.e. mcollective client will relay a message back to the broker after several attempts at connecting to activeMQ.

By leaving the value for plugin.activemq.max_reconnect_attempts=0 in server.cfg we ensure that mcollective will continue to attempt to connect to activeMQ until activeMQ is back up.

More info available
http://docs.puppetlabs.com/mcollective/reference/plugins/connector_activemq.html

li
----------
Added STOMP config params and set plugin.activemq.max_reconnect_attempts=10
https://github.com/openshift/li/pull/2609

origin-server
-------------
- Changed exception raised by rpcclient to NodeUnavailableException to indicate that retry is advisable. i.e. results in HTTP status code 503

- nil check on rpc_client before calling disconnect to prevent runtime exception being thrown when connection is unsuccessful. i.e. undefined method `disconnect' for nil:NilClass 

https://github.com/openshift/origin-server/pull/5278

Comment 4 Lili Nader 2014-04-16 06:10:33 UTC

New message returned from broker when activeMQ is down.  This relays the message from mcollective client "as is".

rhc app start -a app 
Unable to complete the requested operation due to: Could not connect to ActiveMQ Server: Stomp::Error::MaxReconnectAttempts. Please try again and contact support if the issue persists.

Comment 5 openshift-github-bot 2014-04-16 20:20:41 UTC

Commit pushed to master at https://github.com/openshift/li

https://github.com/openshift/li/commit/4149f5af518294926b6f60f37be05478721d86a7
Bug 1065047 - limit connection attempts by mcollective client to 10 (rather than indefinite)

Comment 6 openshift-github-bot 2014-04-16 20:20:43 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/c66d9cff6a3d90e4ee6adcdc5b31eae7b680fa01
Bug 1065047 - changed exception raised to NodeUnavailableException to indicate retry advisable (503)

Comment 7 Jianwei Hou 2014-04-17 08:00:59 UTC

Verified on devenv_4866

When activemq is in an outage(stopped in my case), more reasonable messages are displayed to end user.

Unable to complete the requested operation due to: Could not connect to ActiveMQ Server: Stomp::Error::MaxReconnectAttempts. Please try again and
contact support if the issue persists.

Note You need to log in before you can comment on or make changes to this bug.