Bug 1048148 - oo-diagnostics hangs while querying nodes
Summary: oo-diagnostics hangs while querying nodes
Keywords:
Status: CLOSED DUPLICATE of bug 1122872
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 2.0.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Luke Meyer
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-03 09:48 UTC by Ma xiaoqiang
Modified: 2016-07-04 00:44 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-07-24 13:38:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ma xiaoqiang 2014-01-03 09:48:48 UTC
Description of problem:
oo-diagnostics hangs with '--abortok' option if the DNS service is down


Version-Release number of selected component (if applicable):
rubygem-openshift-origin-common-1.17.2.5-1.el6op.noarch

How reproducible:
always

Steps to Reproduce:
1.stop the DNS service
#/etc/init.d/named stop
2.run oo-diagnostics
#oo-diagnostics --abortok -v 


Actual results:

oo-diagnostics hangs when checking node profiles via MCollective
Output:
NFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: OpenShift node installed.
INFO: Loading the broker rails environment.
INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
FAIL: prereq_dns_server_available
        192.168.59.193 doesn't appear to respond to DNS requests.
        This command:
          host -W 1 example.com. 192.168.59.193
        should have connected to your primary nameserver.
        Instead, it returned:
          ;; connection timed out; trying next origin
;; connection timed out; no servers could be reached

        Please check the following to resolve this issue:
        * Does /etc/resolv.conf have your correct nameserver?
        * Is your nameserver running?
        * Is the firewall on your nameserver open (udp:53)?
        * Can you connect to your nameserver?
        Many OpenShift functions fail without working DNS resolution.

INFO: running: prereq_domain_resolves
INFO: checking that we can resolve our application domain
FAIL: prereq_domain_resolves
        Application domain does not appear to resolve under
        current nameserver configuration. This command:
          host -W 5 -t NS 'ose20-xiama.com.cn'
        should have returned the nameserver(s) for ose20-xiama.com.cn.
        Instead, it returned:
          Host ose20-xiama.com.cn not found: 3(NXDOMAIN)

        Please check the following to resolve this issue:
        * Is CLOUD_DOMAIN=ose20-xiama.com.cn in broker.conf correct?
        * Does /etc/resolv.conf have the right nameserver(s)?
        * Is your OpenShift domain nameserver running?
        * Is the firewall on your nameserver open (udp:53)?
        * Does your nameserver respond to queries via dig/host?
        Many OpenShift functions may fail without application DNS.

INFO: running: test_enterprise_rpms
INFO: Checking that all OpenShift RPMs are actually from OpenShift Enterprise
INFO: running: test_selinux_policy_rpm
INFO: rpm selinux-policy installed with at least version 3.7.19-195.el6_4.4
INFO: running: test_selinux_enabled
INFO: running: test_broker_cache_permissions
INFO: broker application cache permissions appear fine
INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective


Expected results:
It should continue tests.

Additional info:

Comment 2 Luke Meyer 2014-01-03 17:51:03 UTC
If the host DNS server is down (requiring --abortok), you should expect problems.

However, I recall that there are some other scenarios (this is just one) where mcollective can't reach activemq and simply hangs, and I agree that oo-diagnostics should not do that. Thanks for making the bug so we can remember to look into this and ensure there is always some kind of timeout enforced.

I can't remember exactly when I've seen this before, but here are some scenarios to test:
1. ActiveMQ host doesn't resolve (like this scenario)
2. Port 61613 is not open on ActiveMQ host
3. ActiveMQ is not started

And for that matter, we may encounter the same kinds of problems when mongodb is unreachable.

Comment 3 Luke Meyer 2014-01-03 18:13:27 UTC
Actually... I think mcollective does time out normally, and it may be mongodb that's hanging (looking up districts). Check both.

Comment 4 Ma xiaoqiang 2014-02-07 13:42:38 UTC
check it on puddle [2.0.4/2014-02-06.1]
Only when activemq servive is broken, e.g: dns is down, activemq is down, port 61613 is down, oo-diagnostics will hang

scenario one:
Stop the activemq or stop the port 61613, oo-diagnostics will hang
#/etc/init.d/activemq stop
#iptables -A INPUT -p tcp --dport 61613 -j DROP
#iptables -A OUTPUT -p tcp --dport 61613 -j DROP
2.run oo-diagnostics
#oo-diagnostics --abortok -v
Output:
INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: OpenShift node installed.
INFO: Loading the broker rails environment.
INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: prereq_domain_resolves
INFO: checking that we can resolve our application domain
INFO: running: test_enterprise_rpms
INFO: Checking that all OpenShift RPMs are actually from OpenShift Enterprise
INFO: running: test_selinux_policy_rpm
INFO: rpm selinux-policy installed with at least version 3.7.19-195.el6_4.4
INFO: running: test_selinux_enabled
INFO: running: test_broker_cache_permissions
INFO: broker application cache permissions appear fine
INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective

scenario two:
Stop the mongodb.oo-diagnostics run normally.
#/etc/init.d/mongod stop
2.run oo-diagnostics
#oo-diagnostics --abortok -v
Output:
INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: OpenShift node installed.
INFO: Loading the broker rails environment.
  MOPED: Retrying connection to primary for replica set <Moped::Cluster nodes=[<Moped::Node resolved_address="127.0.0.1:27017">]>
  MOPED: Retrying connection to primary for replica set <Moped::Cluster nodes
<--snip-->
FAIL: rescue in load_broker_rails_env
        Broker application failed to load. This is often a gem dependency problem.
        Updating rubygem RPMs and restarting openshift-broker
        to regenerate the broker Gemfile.lock may fix the problem.
        The actual error encountered was:
        #<Moped::Errors::ConnectionFailure: Could not connect to a primary node for replica set <Moped::Cluster nodes=[<Moped::Node resolved_address="127.0.0.1:27017">]>>
        ***
        THIS PROBLEM NEEDS TO BE RESOLVED FOR THE BROKER TO WORK.
        DISABLING BROKER TESTS.
        ***

INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: prereq_domain_resolves
INFO: checking that we can resolve our application domain
<--snip-->

scenario 3:
Stop the mcollective service .oo-diagnostics can run normally.
#/etc/init.d/mongod stop
2.run oo-diagnostics
#oo-diagnostics --abortok -v
Output:
[root@broker ~]# oo-diagnostics -o -v
INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: OpenShift node installed.
INFO: Loading the broker rails environment.
INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: prereq_domain_resolves
INFO: checking that we can resolve our application domain
INFO: running: test_enterprise_rpms
INFO: Checking that all OpenShift RPMs are actually from OpenShift Enterprise
INFO: running: test_selinux_policy_rpm
INFO: rpm selinux-policy installed with at least version 3.7.19-195.el6_4.4
INFO: running: test_selinux_enabled
INFO: running: test_broker_cache_permissions
INFO: broker application cache permissions appear fine
INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective

No request sent, we did not discover any nodes.FAIL: test_node_profiles_districts_from_broker
          No node hosts found. Please install some,
          or ensure the existing ones respond to 'oo-mco ping'.
          OpenShift cannot host gears without at least one node host responding.

INFO: skipping test_node_profiles_districts_from_broker
INFO: running: test_broker_accept_scripts
INFO: running oo-accept-broker
INFO: oo-accept-broker ran without error:
<--snip-->
INFO: running: test_broker_passenger_ps
INFO: checking the broker application process tree
INFO: running: test_for_nonrpm_rubygems
INFO: checking for presence of gem-installed rubygems
INFO: looking in /opt/rh/ruby193/root/usr/local/share/gems/specifications/*.gemspec /opt/rh/ruby193/root/usr/share/gems/specifications/*.gemspec
INFO: running: test_for_multiple_gem_versions
INFO: checking for presence of gem-installed rubygems
INFO: running: test_node_httpd_error_log
INFO: running: test_node_containerization_plugin
INFO: running: test_node_mco_log
INFO: running: test_pam_openshift
INFO: running: test_services_enabled
INFO: checking that required services are running now
FAIL: test_services_enabled
      The following service(s) are not currently started:
      ruby193-mcollective
      These services are required for OpenShift functionality.

INFO: checking that required services are enabled at boot
INFO: running: test_missing_iptables_config
<--snip--> 
test_auth_conf_files
INFO: running: test_broker_certificate
WARN: test_broker_certificate
Using a self-signed certificate for the broker
INFO: running: test_abrt_addon_python
INFO: running: test_node_frontend_clash
INFO: running: test_yum_configuration
4 WARNINGS
5 ERRORS

Comment 5 Ma xiaoqiang 2014-02-07 13:43:31 UTC
check it on puddle [2.0.4/2014-02-06.1]
Only when activemq servive is broken, e.g: dns is down, activemq is down, port 61613 is down, oo-diagnostics will hang

scenario one:
Stop the activemq or stop the port 61613, oo-diagnostics will hang
#/etc/init.d/activemq stop
#iptables -A INPUT -p tcp --dport 61613 -j DROP
#iptables -A OUTPUT -p tcp --dport 61613 -j DROP
2.run oo-diagnostics
#oo-diagnostics --abortok -v
Output:
INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: OpenShift node installed.
INFO: Loading the broker rails environment.
INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: prereq_domain_resolves
INFO: checking that we can resolve our application domain
INFO: running: test_enterprise_rpms
INFO: Checking that all OpenShift RPMs are actually from OpenShift Enterprise
INFO: running: test_selinux_policy_rpm
INFO: rpm selinux-policy installed with at least version 3.7.19-195.el6_4.4
INFO: running: test_selinux_enabled
INFO: running: test_broker_cache_permissions
INFO: broker application cache permissions appear fine
INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective

scenario two:
Stop the mongodb.oo-diagnostics run normally.
#/etc/init.d/mongod stop
2.run oo-diagnostics
#oo-diagnostics --abortok -v
Output:
INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: OpenShift node installed.
INFO: Loading the broker rails environment.
  MOPED: Retrying connection to primary for replica set <Moped::Cluster nodes=[<Moped::Node resolved_address="127.0.0.1:27017">]>
  MOPED: Retrying connection to primary for replica set <Moped::Cluster nodes
<--snip-->
FAIL: rescue in load_broker_rails_env
        Broker application failed to load. This is often a gem dependency problem.
        Updating rubygem RPMs and restarting openshift-broker
        to regenerate the broker Gemfile.lock may fix the problem.
        The actual error encountered was:
        #<Moped::Errors::ConnectionFailure: Could not connect to a primary node for replica set <Moped::Cluster nodes=[<Moped::Node resolved_address="127.0.0.1:27017">]>>
        ***
        THIS PROBLEM NEEDS TO BE RESOLVED FOR THE BROKER TO WORK.
        DISABLING BROKER TESTS.
        ***

INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: prereq_domain_resolves
INFO: checking that we can resolve our application domain
<--snip-->

scenario 3:
Stop the mcollective service .oo-diagnostics can run normally.
#/etc/init.d/mongod stop
2.run oo-diagnostics
#oo-diagnostics --abortok -v
Output:
[root@broker ~]# oo-diagnostics -o -v
INFO: loading list of installed packages
INFO: OpenShift broker installed.
INFO: OpenShift node installed.
INFO: Loading the broker rails environment.
INFO: running: prereq_dns_server_available
INFO: checking that the first server in /etc/resolv.conf responds
INFO: running: prereq_domain_resolves
INFO: checking that we can resolve our application domain
INFO: running: test_enterprise_rpms
INFO: Checking that all OpenShift RPMs are actually from OpenShift Enterprise
INFO: running: test_selinux_policy_rpm
INFO: rpm selinux-policy installed with at least version 3.7.19-195.el6_4.4
INFO: running: test_selinux_enabled
INFO: running: test_broker_cache_permissions
INFO: broker application cache permissions appear fine
INFO: running: test_node_profiles_districts_from_broker
INFO: checking node profiles via MCollective

No request sent, we did not discover any nodes.FAIL: test_node_profiles_districts_from_broker
          No node hosts found. Please install some,
          or ensure the existing ones respond to 'oo-mco ping'.
          OpenShift cannot host gears without at least one node host responding.

INFO: skipping test_node_profiles_districts_from_broker
INFO: running: test_broker_accept_scripts
INFO: running oo-accept-broker
INFO: oo-accept-broker ran without error:
<--snip-->
INFO: running: test_broker_passenger_ps
INFO: checking the broker application process tree
INFO: running: test_for_nonrpm_rubygems
INFO: checking for presence of gem-installed rubygems
INFO: looking in /opt/rh/ruby193/root/usr/local/share/gems/specifications/*.gemspec /opt/rh/ruby193/root/usr/share/gems/specifications/*.gemspec
INFO: running: test_for_multiple_gem_versions
INFO: checking for presence of gem-installed rubygems
INFO: running: test_node_httpd_error_log
INFO: running: test_node_containerization_plugin
INFO: running: test_node_mco_log
INFO: running: test_pam_openshift
INFO: running: test_services_enabled
INFO: checking that required services are running now
FAIL: test_services_enabled
      The following service(s) are not currently started:
      ruby193-mcollective
      These services are required for OpenShift functionality.

INFO: checking that required services are enabled at boot
INFO: running: test_missing_iptables_config
<--snip--> 
test_auth_conf_files
INFO: running: test_broker_certificate
WARN: test_broker_certificate
Using a self-signed certificate for the broker
INFO: running: test_abrt_addon_python
INFO: running: test_node_frontend_clash
INFO: running: test_yum_configuration
4 WARNINGS
5 ERRORS

Comment 6 Luke Meyer 2014-07-24 13:38:53 UTC
Consolidating all "hangs on activemq down" bugs.

*** This bug has been marked as a duplicate of bug 1122872 ***


Note You need to log in before you can comment on or make changes to this bug.