1108462 – oo-admin-ctl-district -c add-node hangs when ActiveMQ server hostname does not resolve

Bug 1108462 - oo-admin-ctl-district -c add-node hangs when ActiveMQ server hostname does not resolve

Summary: oo-admin-ctl-district -c add-node hangs when ActiveMQ server hostname does no...

Keywords:
Status:	CLOSED DUPLICATE of bug 1122872
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	2.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Luke Meyer
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-12 05:54 UTC by Jan Pazdziora (Red Hat)
Modified:	2014-07-24 13:39 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-07-24 13:39:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jan Pazdziora (Red Hat) 2014-06-12 05:54:58 UTC

Description of problem:

In a setup with external ActiveMQ machine and another machine serving the rest of functions, set up using openshift.sh, attempt to add the broker via

oo-admin-ctl-district -c add-node -n small_district -i broker.example.com

hangs.

Version-Release number of selected component (if applicable):

OpenShift Enterprise 2.1

How reproducible:

Deterministic

Steps to Reproduce:
1. Install Enterprise 2.1 via openshift.sh in such a way that there is broker also serving as node and datastore and DNS server, and separate Active MQ machine.
2. Use CONF_NO_SCRAMBLE to make sure the passwords match (should this be needed)?
3. Run oo-admin-ctl-district -c create -n small_district -p small, that passes.
4. Run oo-admin-ctl-district -c add-node -n small_district -i broker.example.com

Actual results:

It blocks undefinitely.

When killed, the process outputs

{"_id"=>"5398a29b6892df0b80000001",
 "uuid"=>"5398a29b6892df0b80000001",
 "available_uids"=>"<6000 uids hidden>",
 "name"=>"small_district",
 "gear_size"=>"small",
 "available_capacity"=>6000,
 "max_uid"=>6999,
 "max_capacity"=>6000,
 "active_servers_size"=>0,
 "updated_at"=>2014-06-11 18:40:27 UTC,
 "created_at"=>2014-06-11 18:40:27 UTC}

ERROR OUTPUT:
Error for node 'broker.example.com': Could not connect to ActiveMQ Server: SIGTERM

Something like (taken from all-on-one installation):

Expected results:
{"_id"=>"53985f926892df5c4b000001",
 "active_servers_size"=>1,
 "available_capacity"=>6000,
 "available_uids"=>"<6000 uids hidden>",
 "created_at"=>2014-06-11 13:54:26 UTC,
 "gear_size"=>"small",
 "max_capacity"=>6000,
 "max_uid"=>6999,
 "name"=>"small_district",
 "servers"=>
  [{"_id"=>"53985f9a6892df78f1000001",
    "active"=>true,
    "name"=>"broker.example.com",
    "unresponsive"=>false}],
 "updated_at"=>2014-06-11 13:54:26 UTC,
 "uuid"=>"53985f926892df5c4b000001"}

Additional info:

Comment 3 Luke Meyer 2014-06-12 13:08:28 UTC

(In reply to Jan Pazdziora from comment #0)

> 2. Use CONF_NO_SCRAMBLE to make sure the passwords match (should this be
> needed)?

You need to either do that, or configure the mcollective password directly (CONF_MCOLLECTIVE_PASSWORD). On ALL hosts; it's configured into broker, node, and activemq as well.

> 3. Run oo-admin-ctl-district -c create -n small_district -p small, that
> passes.

... which doesn't touch mcollective

> 4. Run oo-admin-ctl-district -c add-node -n small_district -i

... which *does* use mcollective.

Check with oo-mco ping whether mcollective is working. Probably not - the question is whether it's something DNS/networking related or what. I don't think it would hang if it were just a question of password mismatch; though I think the mcollective timeout defaults to 3 minutes and could probably be lowered for this use case.

Comment 9 Jan Pazdziora (Red Hat) 2014-06-13 07:00:54 UTC

The cause of the problem was indeed missing DNS record for the ActiveMQ server.

So this bugzilla can be closed as NOTABUG, unless you'd feel that openshift.sh (?) should detect that the ActiveMQ is unreachable (unresolvable, even) and fail complaining loudly. Or maybe oo-admin-ctl-district could detect that early enough and report the issue, rather than waiting for some response for (what seems to be) a long time.

Comment 10 Luke Meyer 2014-06-13 16:55:39 UTC

Even though this is in some sense "user error" it's been noted that OpenShift doesn't deal well with ActiveMQ not being available in a number of scenarios. I have no doubt customers will run into this problem and it will reflect badly on OpenShift.

With that in mind, a few improvements to consider:

1. Do some basic sanity testing as part of post_deploy, specifically to test whether known hostnames resolve. We don't want to recreate oo-diagnostics but some obvious and common puzzlers like this could be avoided, and DNS in particular is often a pain point. I think if the hostname at least resolves, then other problems like resolving to the wrong host or the host being unreachable at least get better, possibly even give informative errors.

2. Have oo-admin-ctl-district specify a lower mco timeout. It will not need 3 minutes to get and set district info on the node. More like 10 seconds at the outside. I am not certain that this timeout kicks in if activemq doesn't even resolve (may require an actual request be sent); we should check.

3. Longer term... see if there's any way to get the broker mco plugin to catch this condition across the board in a reasonable amount of time rather than 3 minutes (or forever).

Comment 11 Luke Meyer 2014-07-24 13:39:27 UTC

Consolidating all "hangs on activemq down" bugs.

*** This bug has been marked as a duplicate of bug 1122872 ***

Note You need to log in before you can comment on or make changes to this bug.