Bug 968192 - Sometimes after finishing openshift.sh and restarting services, oo-diagnostics reports No request sent, we did not discover any nodes.
Sometimes after finishing openshift.sh and restarting services, oo-diagnostic...
Status: CLOSED NOTABUG
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
1.2.0
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Jason DeTiberus
libra bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-29 04:04 EDT by Jan Pazdziora
Modified: 2017-03-08 12 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-05-30 13:44:42 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jan Pazdziora 2013-05-29 04:04:50 EDT
Description of problem:

I run openshift.sh to install OpenShift Enterprise, followed by restart of the services (I cannot really reboot the machine after openshift.sh is finished), and then oo-diagnostics.

Sometimes the oo-diagnostics passes on both broker and node, sometimes it fails with

INFO: running: test_broker_cache_permissions

No request sent, we did not discover any nodes.[31mFAIL: test_nodes_public_hostname
No node hosts responded. Run 'mco ping' and troubleshoot if this is unexpected.[0m
INFO: broker application cache permissions appear fine
INFO: running: test_nodes_public_hostname
INFO: checking that each public_hostname resolves properly
INFO: running: test_nodes_public_ip
INFO: checking that public_ip has been set for all nodes

No request sent, we did not discover any nodes.INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective

No request sent, we did not discover any nodes.[31mFAIL: test_node_profiles_districts_from_broker
          No node hosts found. Please install some,
          or ensure the existing ones respond to 'mco ping'.
          OpenShift cannot host gears without at least one node host responding.
[0m
INFO: skipping test_node_profiles_districts_from_broker
INFO: running: test_broker_accept_scripts

Version-Release number of selected component (if applicable):

OpenShiftEnterprise/1.2/2013-05-23.2 installed using https://raw.github.com/openshift/openshift-extras/enterprise-1.2/enterprise/install-scripts/generic/openshift.sh

How reproducible:

Not deterministic.

Steps to Reproduce:
1. Run openshift.sh on two machines, one serving as node, the other one as broker (and named, activemq, and datastore).
2. After openshift.sh finishes on both machines, restart the services. I use

for i in activemq mcollective mongod $(cd /etc/init.d && ls openshift-*) cgred oddjobd httpd ; do echo $i ; service $i restart ; done

to hopefully restart everything that needs restarting.

3. Wait for the restart to finish on both machines.
4. Download oo-diagnostics and run ./oo-diagnostics -v -w 1.

Actual results:

INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: test_enterprise_rpms
INFO: Checking that all OpenShift RPMs are actually from OpenShift Enterprise
INFO: running: test_selinux_policy_rpm
INFO: rpm selinux-policy installed with at least version 3.7.19-155.el6_3.8
INFO: running: test_selinux_enabled
INFO: running: test_broker_cache_permissions

No request sent, we did not discover any nodes.[31mFAIL: test_nodes_public_hostname
No node hosts responded. Run 'mco ping' and troubleshoot if this is unexpected.[0m
INFO: broker application cache permissions appear fine
INFO: running: test_nodes_public_hostname
INFO: checking that each public_hostname resolves properly
INFO: running: test_nodes_public_ip
INFO: checking that public_ip has been set for all nodes

No request sent, we did not discover any nodes.INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective

No request sent, we did not discover any nodes.[31mFAIL: test_node_profiles_districts_from_broker
          No node hosts found. Please install some,
          or ensure the existing ones respond to 'mco ping'.
          OpenShift cannot host gears without at least one node host responding.
[0m
INFO: skipping test_node_profiles_districts_from_broker
INFO: running: test_broker_accept_scripts
INFO: running oo-accept-broker
INFO: oo-accept-broker ran without error:
--BEGIN OUTPUT--
NOTICE: SELinux is Enforcing
NOTICE: SELinux is  Enforcing
PASS

--END oo-accept-broker OUTPUT--
INFO: running oo-accept-systems -w 1.0
INFO: oo-accept-systems -w 1.0 ran without error:
--BEGIN OUTPUT--
PASS

--END oo-accept-systems -w 1.0 OUTPUT--
INFO: running: test_node_accept_scripts
INFO: skipping test_node_accept_scripts
INFO: running: test_broker_httpd_error_log
INFO: running: test_broker_passenger_ps
INFO: checking the broker application process tree
INFO: running: test_for_nonrpm_rubygems
INFO: checking for presence of gem-installed rubygems
INFO: looking in /opt/rh/ruby193/root/usr/local/share/gems/specifications/*.gemspec /opt/rh/ruby193/root/usr/share/gems/specifications/*.gemspec
INFO: running: test_for_multiple_gem_versions
INFO: checking for presence of gem-installed rubygems
INFO: running: test_node_httpd_error_log
INFO: running: test_node_mco_log
INFO: skipping test_node_mco_log
INFO: running: test_pam_openshift
INFO: skipping test_pam_openshift
INFO: running: test_services_enabled
INFO: checking that required services are running now
INFO: checking that required services are enabled at boot
INFO: running: test_node_quota_bug
INFO: skipping test_node_quota_bug
INFO: running: test_vhost_servernames
INFO: checking for vhost interference problems
INFO: running: test_altered_package_owned_configs
INFO: running: test_broken_httpd_version
[31m2 ERRORS[0m

Expected results:

INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: test_enterprise_rpms
INFO: Checking that all OpenShift RPMs are actually from OpenShift Enterprise
INFO: running: test_selinux_policy_rpm
INFO: rpm selinux-policy installed with at least version 3.7.19-155.el6_3.8
INFO: running: test_selinux_enabled
INFO: running: test_broker_cache_permissions
INFO: broker application cache permissions appear fine
INFO: running: test_nodes_public_hostname
INFO: checking that each public_hostname resolves properly
INFO: PUBLIC_HOSTNAME node.example.com for node.example.com resolves to 10.16.46.57
INFO: checking that each public_hostname is unique
INFO: running: test_nodes_public_ip
INFO: checking that public_ip has been set for all nodes
INFO: PUBLIC_IP 10.16.46.57 for node.example.com
INFO: checking that public_ip is unique for all nodes
INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective
INFO: profile for node.example.com: small
[33mWARN: test_node_profiles_districts_from_broker
        No districts are defined. Districts should be used in any production installation.
        Please consult the Administration Guide.
[0m
INFO: skipping test_node_profiles_districts_from_broker
INFO: running: test_broker_accept_scripts
INFO: running oo-accept-broker
INFO: oo-accept-broker ran without error:
--BEGIN OUTPUT--
NOTICE: SELinux is Enforcing
NOTICE: SELinux is  Enforcing
PASS

--END oo-accept-broker OUTPUT--
INFO: running oo-accept-systems -w 1.0
INFO: oo-accept-systems -w 1.0 ran without error:
--BEGIN OUTPUT--
PASS

--END oo-accept-systems -w 1.0 OUTPUT--
INFO: running: test_node_accept_scripts
INFO: skipping test_node_accept_scripts
INFO: running: test_broker_httpd_error_log
INFO: running: test_broker_passenger_ps
INFO: checking the broker application process tree
INFO: running: test_for_nonrpm_rubygems
INFO: checking for presence of gem-installed rubygems
INFO: looking in /opt/rh/ruby193/root/usr/local/share/gems/specifications/*.gemspec /opt/rh/ruby193/root/usr/share/gems/specifications/*.gemspec
INFO: running: test_for_multiple_gem_versions
INFO: checking for presence of gem-installed rubygems
INFO: running: test_node_httpd_error_log
INFO: running: test_node_mco_log
INFO: skipping test_node_mco_log
INFO: running: test_pam_openshift
INFO: skipping test_pam_openshift
INFO: running: test_services_enabled
INFO: checking that required services are running now
INFO: checking that required services are enabled at boot
INFO: running: test_node_quota_bug
INFO: skipping test_node_quota_bug
INFO: running: test_vhost_servernames
INFO: checking for vhost interference problems
INFO: running: test_altered_package_owned_configs
INFO: running: test_broken_httpd_version
[33m1 WARNINGS[0m
NO ERRORS

Additional info:
Comment 2 Jason DeTiberus 2013-05-29 16:02:31 EDT
Jan,

Could you try adding a 30 second sleep between the services restart and the oo-diagnostics run?

This should allow sufficient time for mcollective to re-establish a connection with activemq following the service restarts.

I believe the problem you are experiencing is when the activemq service on the broker is being restarted after the mcollective service has been restarted on the node.
Comment 5 Jason DeTiberus 2013-05-30 09:07:58 EDT
Adding synchronization between the broker and node so that the broker services are restarted before the node services should be sufficient, yes.
Comment 6 Miciah Dashiel Butler Masters 2013-05-30 09:11:22 EDT
The service script for ActiveMQ can return before the daemon is ready to accept connections.  Initialisation of the daemon can take a couple minutes, so there could still be problems.  It would be helpful to have /var/log/activemq/activemq.log to see whether that's involved in the problem reported here.
Comment 7 Jan Pazdziora 2013-05-30 09:15:54 EDT
(In reply to Jason DeTiberus from comment #5)
> Adding synchronization between the broker and node so that the broker
> services are restarted before the node services should be sufficient, yes.

Thanks. I will do that and see how it works. I assume this bugzilla can be closed as NOTABUG?
Comment 9 Jan Pazdziora 2013-05-30 09:22:48 EDT
(In reply to Miciah Dashiel Butler Masters from comment #6)
> The service script for ActiveMQ can return before the daemon is ready to
> accept connections.  Initialisation of the daemon can take a couple minutes,
> so there could still be problems.  It would be helpful to have
> /var/log/activemq/activemq.log to see whether that's involved in the problem
> reported here.

I think it would be very helpful to have a script supported by OpenShift developers to start and restart the services, and that could add the necessary waits and guarantee that when it finishes, all services are in a ready state.

For example, in Spacewalk, we have

  https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/spacewalk-service

which calls

  https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/spacewalk-startup-helper

to for example tomcat is up and accepting connections (we test with lsof) before starting httpd.

This ensures that users cannot hit Apache and see 503 from tomcat -- once the spacewalk-service start finishes (unless it failed badly), daemons of all components are ready to accept connections and serve.

I probably can create an initial version of such a script if it would be viewed as useful.
Comment 10 Jason DeTiberus 2013-05-30 09:28:28 EDT
(In reply to Jan Pazdziora from comment #9)
> (In reply to Miciah Dashiel Butler Masters from comment #6)
> > The service script for ActiveMQ can return before the daemon is ready to
> > accept connections.  Initialisation of the daemon can take a couple minutes,
> > so there could still be problems.  It would be helpful to have
> > /var/log/activemq/activemq.log to see whether that's involved in the problem
> > reported here.
> 
> I think it would be very helpful to have a script supported by OpenShift
> developers to start and restart the services, and that could add the
> necessary waits and guarantee that when it finishes, all services are in a
> ready state.
> 
> For example, in Spacewalk, we have
> 
>  
> https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/
> spacewalk-service
> 
> which calls
> 
>  
> https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/
> spacewalk-startup-helper
> 
> to for example tomcat is up and accepting connections (we test with lsof)
> before starting httpd.
> 
> This ensures that users cannot hit Apache and see 503 from tomcat -- once
> the spacewalk-service start finishes (unless it failed badly), daemons of
> all components are ready to accept connections and serve.
> 
> I probably can create an initial version of such a script if it would be
> viewed as useful.

I think a script like that could be useful in an all-in-one type deployment, but once services are broken out into multiple hosts (as we recommend for production), then I think the script gets complicated.  

Potentially we could have the broker script verify that the services it depends on are up before starting (activemq, mongo, etc), however I don't think we necessarily want those to fail if the services aren't available either (since they will reconnect to those services when they become available).

Note You need to log in before you can comment on or make changes to this bug.