Bug 968192

Summary:	Sometimes after finishing openshift.sh and restarting services, oo-diagnostics reports No request sent, we did not discover any nodes.
Product:	OpenShift Container Platform	Reporter:	Jan Pazdziora <jpazdziora>
Component:	Node	Assignee:	Jason DeTiberus <jdetiber>
Status:	CLOSED NOTABUG	QA Contact:	libra bugs <libra-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	1.2.0	CC:	bleanhar, jdetiber, jpazdziora, libra-onpremise-devel, mmasters
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-05-30 17:44:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Pazdziora 2013-05-29 08:04:50 UTC

Description of problem:

I run openshift.sh to install OpenShift Enterprise, followed by restart of the services (I cannot really reboot the machine after openshift.sh is finished), and then oo-diagnostics.

Sometimes the oo-diagnostics passes on both broker and node, sometimes it fails with

INFO: running: test_broker_cache_permissions

No request sent, we did not discover any nodes.[31mFAIL: test_nodes_public_hostname
No node hosts responded. Run 'mco ping' and troubleshoot if this is unexpected.[0m
INFO: broker application cache permissions appear fine
INFO: running: test_nodes_public_hostname
INFO: checking that each public_hostname resolves properly
INFO: running: test_nodes_public_ip
INFO: checking that public_ip has been set for all nodes

No request sent, we did not discover any nodes.INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective

No request sent, we did not discover any nodes.[31mFAIL: test_node_profiles_districts_from_broker
          No node hosts found. Please install some,
          or ensure the existing ones respond to 'mco ping'.
          OpenShift cannot host gears without at least one node host responding.
[0m
INFO: skipping test_node_profiles_districts_from_broker
INFO: running: test_broker_accept_scripts

Version-Release number of selected component (if applicable):

OpenShiftEnterprise/1.2/2013-05-23.2 installed using https://raw.github.com/openshift/openshift-extras/enterprise-1.2/enterprise/install-scripts/generic/openshift.sh

How reproducible:

Not deterministic.

Steps to Reproduce:
1. Run openshift.sh on two machines, one serving as node, the other one as broker (and named, activemq, and datastore).
2. After openshift.sh finishes on both machines, restart the services. I use

for i in activemq mcollective mongod $(cd /etc/init.d && ls openshift-*) cgred oddjobd httpd ; do echo $i ; service $i restart ; done

to hopefully restart everything that needs restarting.

3. Wait for the restart to finish on both machines.
4. Download oo-diagnostics and run ./oo-diagnostics -v -w 1.

Actual results:

INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: test_enterprise_rpms
INFO: Checking that all OpenShift RPMs are actually from OpenShift Enterprise
INFO: running: test_selinux_policy_rpm
INFO: rpm selinux-policy installed with at least version 3.7.19-155.el6_3.8
INFO: running: test_selinux_enabled
INFO: running: test_broker_cache_permissions

No request sent, we did not discover any nodes.[31mFAIL: test_nodes_public_hostname
No node hosts responded. Run 'mco ping' and troubleshoot if this is unexpected.[0m
INFO: broker application cache permissions appear fine
INFO: running: test_nodes_public_hostname
INFO: checking that each public_hostname resolves properly
INFO: running: test_nodes_public_ip
INFO: checking that public_ip has been set for all nodes

No request sent, we did not discover any nodes.INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective

No request sent, we did not discover any nodes.[31mFAIL: test_node_profiles_districts_from_broker
          No node hosts found. Please install some,
          or ensure the existing ones respond to 'mco ping'.
          OpenShift cannot host gears without at least one node host responding.
[0m
INFO: skipping test_node_profiles_districts_from_broker
INFO: running: test_broker_accept_scripts
INFO: running oo-accept-broker
INFO: oo-accept-broker ran without error:
--BEGIN OUTPUT--
NOTICE: SELinux is Enforcing
NOTICE: SELinux is  Enforcing
PASS

--END oo-accept-broker OUTPUT--
INFO: running oo-accept-systems -w 1.0
INFO: oo-accept-systems -w 1.0 ran without error:
--BEGIN OUTPUT--
PASS

--END oo-accept-systems -w 1.0 OUTPUT--
INFO: running: test_node_accept_scripts
INFO: skipping test_node_accept_scripts
INFO: running: test_broker_httpd_error_log
INFO: running: test_broker_passenger_ps
INFO: checking the broker application process tree
INFO: running: test_for_nonrpm_rubygems
INFO: checking for presence of gem-installed rubygems
INFO: looking in /opt/rh/ruby193/root/usr/local/share/gems/specifications/*.gemspec /opt/rh/ruby193/root/usr/share/gems/specifications/*.gemspec
INFO: running: test_for_multiple_gem_versions
INFO: checking for presence of gem-installed rubygems
INFO: running: test_node_httpd_error_log
INFO: running: test_node_mco_log
INFO: skipping test_node_mco_log
INFO: running: test_pam_openshift
INFO: skipping test_pam_openshift
INFO: running: test_services_enabled
INFO: checking that required services are running now
INFO: checking that required services are enabled at boot
INFO: running: test_node_quota_bug
INFO: skipping test_node_quota_bug
INFO: running: test_vhost_servernames
INFO: checking for vhost interference problems
INFO: running: test_altered_package_owned_configs
INFO: running: test_broken_httpd_version
[31m2 ERRORS[0m

Expected results:

INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: test_enterprise_rpms
INFO: Checking that all OpenShift RPMs are actually from OpenShift Enterprise
INFO: running: test_selinux_policy_rpm
INFO: rpm selinux-policy installed with at least version 3.7.19-155.el6_3.8
INFO: running: test_selinux_enabled
INFO: running: test_broker_cache_permissions
INFO: broker application cache permissions appear fine
INFO: running: test_nodes_public_hostname
INFO: checking that each public_hostname resolves properly
INFO: PUBLIC_HOSTNAME node.example.com for node.example.com resolves to 10.16.46.57
INFO: checking that each public_hostname is unique
INFO: running: test_nodes_public_ip
INFO: checking that public_ip has been set for all nodes
INFO: PUBLIC_IP 10.16.46.57 for node.example.com
INFO: checking that public_ip is unique for all nodes
INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective
INFO: profile for node.example.com: small
[33mWARN: test_node_profiles_districts_from_broker
        No districts are defined. Districts should be used in any production installation.
        Please consult the Administration Guide.
[0m
INFO: skipping test_node_profiles_districts_from_broker
INFO: running: test_broker_accept_scripts
INFO: running oo-accept-broker
INFO: oo-accept-broker ran without error:
--BEGIN OUTPUT--
NOTICE: SELinux is Enforcing
NOTICE: SELinux is  Enforcing
PASS

--END oo-accept-broker OUTPUT--
INFO: running oo-accept-systems -w 1.0
INFO: oo-accept-systems -w 1.0 ran without error:
--BEGIN OUTPUT--
PASS

--END oo-accept-systems -w 1.0 OUTPUT--
INFO: running: test_node_accept_scripts
INFO: skipping test_node_accept_scripts
INFO: running: test_broker_httpd_error_log
INFO: running: test_broker_passenger_ps
INFO: checking the broker application process tree
INFO: running: test_for_nonrpm_rubygems
INFO: checking for presence of gem-installed rubygems
INFO: looking in /opt/rh/ruby193/root/usr/local/share/gems/specifications/*.gemspec /opt/rh/ruby193/root/usr/share/gems/specifications/*.gemspec
INFO: running: test_for_multiple_gem_versions
INFO: checking for presence of gem-installed rubygems
INFO: running: test_node_httpd_error_log
INFO: running: test_node_mco_log
INFO: skipping test_node_mco_log
INFO: running: test_pam_openshift
INFO: skipping test_pam_openshift
INFO: running: test_services_enabled
INFO: checking that required services are running now
INFO: checking that required services are enabled at boot
INFO: running: test_node_quota_bug
INFO: skipping test_node_quota_bug
INFO: running: test_vhost_servernames
INFO: checking for vhost interference problems
INFO: running: test_altered_package_owned_configs
INFO: running: test_broken_httpd_version
[33m1 WARNINGS[0m
NO ERRORS

Additional info:

Comment 2 Jason DeTiberus 2013-05-29 20:02:31 UTC

Jan,

Could you try adding a 30 second sleep between the services restart and the oo-diagnostics run?

This should allow sufficient time for mcollective to re-establish a connection with activemq following the service restarts.

I believe the problem you are experiencing is when the activemq service on the broker is being restarted after the mcollective service has been restarted on the node.

Comment 5 Jason DeTiberus 2013-05-30 13:07:58 UTC

Adding synchronization between the broker and node so that the broker services are restarted before the node services should be sufficient, yes.

Comment 6 Miciah Dashiel Butler Masters 2013-05-30 13:11:22 UTC

The service script for ActiveMQ can return before the daemon is ready to accept connections.  Initialisation of the daemon can take a couple minutes, so there could still be problems.  It would be helpful to have /var/log/activemq/activemq.log to see whether that's involved in the problem reported here.

Comment 7 Jan Pazdziora 2013-05-30 13:15:54 UTC

(In reply to Jason DeTiberus from comment #5)
> Adding synchronization between the broker and node so that the broker
> services are restarted before the node services should be sufficient, yes.

Thanks. I will do that and see how it works. I assume this bugzilla can be closed as NOTABUG?

Comment 9 Jan Pazdziora 2013-05-30 13:22:48 UTC

(In reply to Miciah Dashiel Butler Masters from comment #6)
> The service script for ActiveMQ can return before the daemon is ready to
> accept connections.  Initialisation of the daemon can take a couple minutes,
> so there could still be problems.  It would be helpful to have
> /var/log/activemq/activemq.log to see whether that's involved in the problem
> reported here.

I think it would be very helpful to have a script supported by OpenShift developers to start and restart the services, and that could add the necessary waits and guarantee that when it finishes, all services are in a ready state.

For example, in Spacewalk, we have

  https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/spacewalk-service

which calls

  https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/spacewalk-startup-helper

to for example tomcat is up and accepting connections (we test with lsof) before starting httpd.

This ensures that users cannot hit Apache and see 503 from tomcat -- once the spacewalk-service start finishes (unless it failed badly), daemons of all components are ready to accept connections and serve.

I probably can create an initial version of such a script if it would be viewed as useful.

Comment 10 Jason DeTiberus 2013-05-30 13:28:28 UTC

(In reply to Jan Pazdziora from comment #9)
> (In reply to Miciah Dashiel Butler Masters from comment #6)
> > The service script for ActiveMQ can return before the daemon is ready to
> > accept connections.  Initialisation of the daemon can take a couple minutes,
> > so there could still be problems.  It would be helpful to have
> > /var/log/activemq/activemq.log to see whether that's involved in the problem
> > reported here.
> 
> I think it would be very helpful to have a script supported by OpenShift
> developers to start and restart the services, and that could add the
> necessary waits and guarantee that when it finishes, all services are in a
> ready state.
> 
> For example, in Spacewalk, we have
> 
>  
> https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/
> spacewalk-service
> 
> which calls
> 
>  
> https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/
> spacewalk-startup-helper
> 
> to for example tomcat is up and accepting connections (we test with lsof)
> before starting httpd.
> 
> This ensures that users cannot hit Apache and see 503 from tomcat -- once
> the spacewalk-service start finishes (unless it failed badly), daemons of
> all components are ready to accept connections and serve.
> 
> I probably can create an initial version of such a script if it would be
> viewed as useful.

I think a script like that could be useful in an all-in-one type deployment, but once services are broken out into multiple hosts (as we recommend for production), then I think the script gets complicated.  

Potentially we could have the broker script verify that the services it depends on are up before starting (activemq, mongo, etc), however I don't think we necessarily want those to fail if the services aren't available either (since they will reconnect to those services when they become available).