Bug 1074553

Summary: [mcollective] Restart ruby193-mcollective results in 2 processes running
Product: OpenShift Online Reporter: Kenny Woodson <kwoodson>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED NEXTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: dmace
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-03-10 15:56:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kenny Woodson 2014-03-10 14:27:47 UTC
Description of problem:

We run some scripts to check and verify that the mcollective openshift agent is up and responsive on the nodes.  In order to do this we run a query from a broker to get a single fact from all the nodes.  On the nodes that do not respond we attempt a 'service ruby193-mcollective restart'.

This has been resulting in two mcollective processes running.  The box still does not respond to our mcollective facts and puts mcollective in a bad state. 

According to the mcollective logs I could not determine why the node went unresponsive in the first place but this has been seen on multiple nodes.

I believe the problem lies when starting the new processes. As you can see in the following lines there are two running within 3 minutes of each other.

ps -efZ | grep server.cfg
unconfined_u:system_r:openshift_initrc_t:s0-s0:c0.c1023 root 346028 1  0 03:29 ? 00:00:08 ruby /opt/rh/ruby193/root/usr/sbin/mcollectived --pid=/opt/rh/ruby193/root/var/run/mcollectived.pid --config=/opt/rh/ruby193/root/etc/mcollective/server.cfg

system_u:system_r:openshift_initrc_t:s0-s0:c0.c1023 root 352348 1  0 03:32 ? 00:00:06 ruby /opt/rh/ruby193/root/usr/sbin/mcollectived --pid=/opt/rh/ruby193/root/var/run/mcollectived.pid --config=/opt/rh/ruby193/root/etc/mcollective/server.cfg


Here is the command that we use to restart the service.  It runs as the zabbix user:
/usr/bin/runcon system_u:system_r:openshift_initrc_t:s0-s0:c0.c1023 /usr/bin/sudo /sbin/service ruby193-mcollective restart

Version-Release number of selected component (if applicable):
rubygem-openshift-origin-node-1.20.7-1.el6oso.noarch
openshift-origin-msg-node-mcollective-1.20.3-1.el6oso.noarch
openshift-origin-msg-common-1.18.4-1.el6oso.noarch

How reproducible:
We are seeing this quite often. I don't know the exact steps.

Steps to Reproduce:
1.
2.
3.

Actual results:
Mcollective is not responding to queries for facts.  We issue a restart to the service and two mco processes are now running.

Expected results:
Mcollective fails to behave, we restart the service as our monitored user and it should kill the first process and start a new one.  The query for facts should then be successful.  This has been working in the past for us for a long time.  I'm not sure what exactly changed.

Additional info:
This is causing the operations team to lose sleep.  Let's please fix this as it is very, very frustrating.

We'd like to figure out why mco is not responding but it is more important to have the service script actually kill and restart properly.

Comment 1 Jhon Honce 2014-03-10 15:56:45 UTC
PR https://github.com/openshift/origin-server/pull/4843 provides a timeout for the facts call that is less than the mcollective agent timeout.  It is hoped this will help with these issues.

Please re-open if this is not the case.