Description of problem: When activemq is unavailable (which can happen due to any number of failures: DNS record missing, network broken, port blocked, activemq stopped or crashed...) it appears that the broker sets no timeout in its attempt to reach activemq via MCollective. Thus the user experience is that their API call stalls until httpd times out the request, and they get no useful error message. There isn't even anything in the broker logs to indicate what is going on. Steps to Reproduce: 1. Create an application "foo" 2. Stop the activemq service 3. Try various commands involving the application, e.g.: # rhc app-restart foo # rhc cartridge add mysql -a foo # rhc app show --gears -a foo # rhc app create foo2 ruby-1.9 Actual results: After several minutes of waiting - An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server 'https://broker.example.com/broker/rest/domain/demo/application/foo/events'. (and similar) Expected results: Timeout after a few seconds when the mco client realizes it can't even connect to activemq and the broker returns a 503 or similar HTTP error code and non-misleading error message e.g. "The service is temporarily unavailable; sorry, please try later." And preferably some nice error messages in the rails log or httpd error_log. Additional info: This is similar to the problem that occurs with mco directed requests to a node that is not answering; but that deserves a separate bug as the "activemq down" problem ought to be a lot easier to detect and manage.
It seems like the client tries indefinitely ( I gave up waiting after 147 attempts) to connect I, [2014-04-09T15:02:22.297553 #3298] INFO -- : activemq.rb:113:in `on_connecting' TCP Connection attempt 0 to stomp://mcollective:6163 I, [2014-04-09T15:02:22.299048 #3298] INFO -- : activemq.rb:128:in `on_connectfail' TCP Connection to stomp://mcollective:6163 failed on attempt 0 . . . I, [2014-04-09T16:07:03.439041 #3298] INFO -- : activemq.rb:113:in `on_connecting' TCP Connection attempt 140 to stomp://mcollective:6163 I, [2014-04-09T16:07:03.439698 #3298] INFO -- : activemq.rb:128:in `on_connectfail' TCP Connection to stomp://mcollective:6163 failed on attempt 140 There are ways to configure how many times it attempts to connect before returning an error according to https://github.com/puppetlabs/marionette-collective/blob/master/plugins/mcollective/connector/activemq.rb#L233 I'm looking into how and where I can configure these settings.
By setting the value plugin.activemq.max_reconnect_attempts = 0 in mcollective server.cfg you can limit the number of attempts. However, once the reconnect attempts have been exhausted, even if the activemq comes up the mcollective server does not try to reconnect and has to be restarted. Ideally, we would like the mcollective to continue to retry by let the broker know within certain time limit that it cannot connect, so the broker can relay the message back to the client. Now looking into mcollective configuration to see what can be done.
So there are 2 config files for mcollective: server.cfg and client.cfg By setting the value of plugin.activemq.max_reconnect_attempts to non zero value in client.cfg we can get the desired outcome. i.e. mcollective client will relay a message back to the broker after several attempts at connecting to activeMQ. By leaving the value for plugin.activemq.max_reconnect_attempts=0 in server.cfg we ensure that mcollective will continue to attempt to connect to activeMQ until activeMQ is back up. More info available http://docs.puppetlabs.com/mcollective/reference/plugins/connector_activemq.html li ---------- Added STOMP config params and set plugin.activemq.max_reconnect_attempts=10 https://github.com/openshift/li/pull/2609 origin-server ------------- - Changed exception raised by rpcclient to NodeUnavailableException to indicate that retry is advisable. i.e. results in HTTP status code 503 - nil check on rpc_client before calling disconnect to prevent runtime exception being thrown when connection is unsuccessful. i.e. undefined method `disconnect' for nil:NilClass https://github.com/openshift/origin-server/pull/5278
New message returned from broker when activeMQ is down. This relays the message from mcollective client "as is". rhc app start -a app Unable to complete the requested operation due to: Could not connect to ActiveMQ Server: Stomp::Error::MaxReconnectAttempts. Please try again and contact support if the issue persists.
Commit pushed to master at https://github.com/openshift/li https://github.com/openshift/li/commit/4149f5af518294926b6f60f37be05478721d86a7 Bug 1065047 - limit connection attempts by mcollective client to 10 (rather than indefinite)
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/c66d9cff6a3d90e4ee6adcdc5b31eae7b680fa01 Bug 1065047 - changed exception raised to NodeUnavailableException to indicate retry advisable (503)
Verified on devenv_4866 When activemq is in an outage(stopped in my case), more reasonable messages are displayed to end user. Unable to complete the requested operation due to: Could not connect to ActiveMQ Server: Stomp::Error::MaxReconnectAttempts. Please try again and contact support if the issue persists.