+++ This bug was initially created as a clone of Bug #1065047 +++ Description of problem: When activemq is unavailable (which can happen due to any number of failures: DNS record missing, network broken, port blocked, activemq stopped or crashed...) it appears that the broker sets no timeout in its attempt to reach activemq via MCollective. Thus the user experience is that their API call stalls until httpd times out the request, and they get no useful error message. There isn't even anything in the broker logs to indicate what is going on. Steps to Reproduce: 1. Create an application "foo" 2. Stop the activemq service 3. Try various commands involving the application, e.g.: # rhc app-restart foo # rhc cartridge add mysql -a foo # rhc app show --gears -a foo # rhc app create foo2 ruby-1.9 Actual results: After several minutes of waiting - An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server 'https://broker.example.com/broker/rest/domain/demo/application/foo/events'. (and similar) Expected results: Timeout after a few seconds when the mco client realizes it can't even connect to activemq and the broker returns a 503 or similar HTTP error code and non-misleading error message e.g. "The service is temporarily unavailable; sorry, please try later." And preferably some nice error messages in the rails log or httpd error_log. Additional info: This is similar to the problem that occurs with mco directed requests to a node that is not answering; but that deserves a separate bug as the "activemq down" problem ought to be a lot easier to detect and manage.
Test this with the following packages: [root@broker ~]# rpm -qa|grep mcollective ruby193-mcollective-client-2.4.1-3.el6op.noarch rubygem-openshift-origin-msg-broker-mcollective-1.22.2-1.git.167.c0332d5.el6op.noarch ruby193-mcollective-common-2.4.1-3.el6op.noarch [root@node1 ~]# rpm -qa|grep mcollective ruby193-mcollective-common-2.4.1-3.el6op.noarch openshift-origin-msg-node-mcollective-1.21.2-1.git.182.5e73e48.el6op.noarch ruby193-mcollective-2.4.1-3.el6op.noarch After stop activemq service, try to restart application, still get the same error from client side: [root@broker conf.d]# rhc app restart app1 Password: ****** An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server 'https://broker.ose-201403214.com.cn/broker/rest/application/532fa7ffcfb77f671400003c/events'. Only mcollective report this error in ruby193-mcollective.log: E, [2014-03-24T00:13:18.321231 #1084] ERROR -- : activemq.rb:133:in `on_miscerr' Unexpected error on connection stomp://mcollective.com.cn:61613: es_recv: connection.receive returning EOF as nil - resetting connection. No clear error information about activemq in httpd or broker logs, there was only such error logs in httpd error_log: [Sun Mar 23 23:52:14 2014] [error] [client 10.66.78.226] (70007)The timeout spec ified has expired: proxy: error reading status line from remote server 127.0.0.1 [Sun Mar 23 23:52:14 2014] [error] [client 10.66.78.226] proxy: Error reading fr om remote server returned by /broker/rest/application/532fa7ffcfb77f671400003c/e vents
https://github.com/openshift/openshift-extras/pull/440 Changing the installer to set decent defaults for mcollective timeouts. Pre-2.1 code changes allowed the timeout error to be displayed. I would consider an ose-upgrade automatic modification to mco configuration but at this time I think it may be best just to note the changes made: broker: add to /opt/rh/ruby193/root/etc/mcollective/client.cfg # Broker will retry ActiveMQ connection, then report error plugin.activemq.initial_reconnect_delay = 0.1 plugin.activemq.max_reconnect_attempts = 6 node: add to /opt/rh/ruby193/root/etc/mcollective/server.cfg # Node should retry connecting to ActiveMQ forever plugin.activemq.max_reconnect_attempts = 0 plugin.activemq.initial_reconnect_delay = 0.1 plugin.activemq.max_reconnect_delay = 4.0
Check on puddle [2.1.z/2014-08-25.2] 1. Create an application "phpapp" 2. Stop the activemq service 3. Try various commands involving the application, e.g.: # rhc app-restart phpapp # rhc cartridge add mysql -a phpapp # rhc app show --gears -a phpapp # rhc app create rb19 ruby-1.9 The output: Unable to complete the requested operation due to: Could not connect to ActiveMQ Server: Stomp::Error::MaxReconnectAttempts. Please try again and contact support if the issue persists. Reference ID: 3cf9f9789921ec0639dd54c4f1a81bb5 Give out useful message.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1183.html