1074553 – [mcollective] Restart ruby193-mcollective results in 2 processes running

Bug 1074553 - [mcollective] Restart ruby193-mcollective results in 2 processes running

Summary: [mcollective] Restart ruby193-mcollective results in 2 processes running

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jhon Honce
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-03-10 14:27 UTC by Kenny Woodson
Modified:	2015-05-14 23:35 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-03-10 15:56:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Kenny Woodson 2014-03-10 14:27:47 UTC

Description of problem:

We run some scripts to check and verify that the mcollective openshift agent is up and responsive on the nodes. In order to do this we run a query from a broker to get a single fact from all the nodes. On the nodes that do not respond we attempt a 'service ruby193-mcollective restart'.

This has been resulting in two mcollective processes running. The box still does not respond to our mcollective facts and puts mcollective in a bad state.

According to the mcollective logs I could not determine why the node went unresponsive in the first place but this has been seen on multiple nodes.

I believe the problem lies when starting the new processes. As you can see in the following lines there are two running within 3 minutes of each other.

ps -efZ | grep server.cfg
unconfined_u:system_r:openshift_initrc_t:s0-s0:c0.c1023 root 346028 1 0 03:29 ? 00:00:08 ruby /opt/rh/ruby193/root/usr/sbin/mcollectived --pid=/opt/rh/ruby193/root/var/run/mcollectived.pid --config=/opt/rh/ruby193/root/etc/mcollective/server.cfg

system_u:system_r:openshift_initrc_t:s0-s0:c0.c1023 root 352348 1 0 03:32 ? 00:00:06 ruby /opt/rh/ruby193/root/usr/sbin/mcollectived --pid=/opt/rh/ruby193/root/var/run/mcollectived.pid --config=/opt/rh/ruby193/root/etc/mcollective/server.cfg

Here is the command that we use to restart the service. It runs as the zabbix user:
/usr/bin/runcon system_u:system_r:openshift_initrc_t:s0-s0:c0.c1023 /usr/bin/sudo /sbin/service ruby193-mcollective restart

Version-Release number of selected component (if applicable):
rubygem-openshift-origin-node-1.20.7-1.el6oso.noarch
openshift-origin-msg-node-mcollective-1.20.3-1.el6oso.noarch
openshift-origin-msg-common-1.18.4-1.el6oso.noarch

How reproducible:
We are seeing this quite often. I don't know the exact steps.

Steps to Reproduce:
1.
2.
3.

Actual results:
Mcollective is not responding to queries for facts. We issue a restart to the service and two mco processes are now running.

Expected results:
Mcollective fails to behave, we restart the service as our monitored user and it should kill the first process and start a new one. The query for facts should then be successful. This has been working in the past for us for a long time. I'm not sure what exactly changed.

Additional info:
This is causing the operations team to lose sleep. Let's please fix this as it is very, very frustrating.

We'd like to figure out why mco is not responding but it is more important to have the service script actually kill and restart properly.

Comment 1 Jhon Honce 2014-03-10 15:56:45 UTC

PR https://github.com/openshift/origin-server/pull/4843 provides a timeout for the facts call that is less than the mcollective agent timeout.  It is hoped this will help with these issues.

Please re-open if this is not the case.

Note You need to log in before you can comment on or make changes to this bug.