Description of problem: In a setup with external ActiveMQ machine and another machine serving the rest of functions, set up using openshift.sh, attempt to add the broker via oo-admin-ctl-district -c add-node -n small_district -i broker.example.com hangs. Version-Release number of selected component (if applicable): OpenShift Enterprise 2.1 How reproducible: Deterministic Steps to Reproduce: 1. Install Enterprise 2.1 via openshift.sh in such a way that there is broker also serving as node and datastore and DNS server, and separate Active MQ machine. 2. Use CONF_NO_SCRAMBLE to make sure the passwords match (should this be needed)? 3. Run oo-admin-ctl-district -c create -n small_district -p small, that passes. 4. Run oo-admin-ctl-district -c add-node -n small_district -i broker.example.com Actual results: It blocks undefinitely. When killed, the process outputs {"_id"=>"5398a29b6892df0b80000001", "uuid"=>"5398a29b6892df0b80000001", "available_uids"=>"<6000 uids hidden>", "name"=>"small_district", "gear_size"=>"small", "available_capacity"=>6000, "max_uid"=>6999, "max_capacity"=>6000, "active_servers_size"=>0, "updated_at"=>2014-06-11 18:40:27 UTC, "created_at"=>2014-06-11 18:40:27 UTC} ERROR OUTPUT: Error for node 'broker.example.com': Could not connect to ActiveMQ Server: SIGTERM Something like (taken from all-on-one installation): Expected results: {"_id"=>"53985f926892df5c4b000001", "active_servers_size"=>1, "available_capacity"=>6000, "available_uids"=>"<6000 uids hidden>", "created_at"=>2014-06-11 13:54:26 UTC, "gear_size"=>"small", "max_capacity"=>6000, "max_uid"=>6999, "name"=>"small_district", "servers"=> [{"_id"=>"53985f9a6892df78f1000001", "active"=>true, "name"=>"broker.example.com", "unresponsive"=>false}], "updated_at"=>2014-06-11 13:54:26 UTC, "uuid"=>"53985f926892df5c4b000001"} Additional info:
(In reply to Jan Pazdziora from comment #0) > 2. Use CONF_NO_SCRAMBLE to make sure the passwords match (should this be > needed)? You need to either do that, or configure the mcollective password directly (CONF_MCOLLECTIVE_PASSWORD). On ALL hosts; it's configured into broker, node, and activemq as well. > 3. Run oo-admin-ctl-district -c create -n small_district -p small, that > passes. ... which doesn't touch mcollective > 4. Run oo-admin-ctl-district -c add-node -n small_district -i ... which *does* use mcollective. Check with oo-mco ping whether mcollective is working. Probably not - the question is whether it's something DNS/networking related or what. I don't think it would hang if it were just a question of password mismatch; though I think the mcollective timeout defaults to 3 minutes and could probably be lowered for this use case.
The cause of the problem was indeed missing DNS record for the ActiveMQ server. So this bugzilla can be closed as NOTABUG, unless you'd feel that openshift.sh (?) should detect that the ActiveMQ is unreachable (unresolvable, even) and fail complaining loudly. Or maybe oo-admin-ctl-district could detect that early enough and report the issue, rather than waiting for some response for (what seems to be) a long time.
Even though this is in some sense "user error" it's been noted that OpenShift doesn't deal well with ActiveMQ not being available in a number of scenarios. I have no doubt customers will run into this problem and it will reflect badly on OpenShift. With that in mind, a few improvements to consider: 1. Do some basic sanity testing as part of post_deploy, specifically to test whether known hostnames resolve. We don't want to recreate oo-diagnostics but some obvious and common puzzlers like this could be avoided, and DNS in particular is often a pain point. I think if the hostname at least resolves, then other problems like resolving to the wrong host or the host being unreachable at least get better, possibly even give informative errors. 2. Have oo-admin-ctl-district specify a lower mco timeout. It will not need 3 minutes to get and set district info on the node. More like 10 seconds at the outside. I am not certain that this timeout kicks in if activemq doesn't even resolve (may require an actual request be sent); we should check. 3. Longer term... see if there's any way to get the broker mco plugin to catch this condition across the board in a reasonable amount of time rather than 3 minutes (or forever).
Consolidating all "hangs on activemq down" bugs. *** This bug has been marked as a duplicate of bug 1122872 ***