1152524 – [RFE] Dns timeout setting for oo-accept-broker

Bug 1152524 - [RFE] Dns timeout setting for oo-accept-broker

Summary: [RFE] Dns timeout setting for oo-accept-broker

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RFE
Sub Component:
Version:	2.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Mike Barrett
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-10-14 11:01 UTC by Javier Ramirez
Modified:	2019-08-15 04:00 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openshift-origin-broker-util-1.36.2.2-1.el6op
Doc Type:	Enhancement
Doc Text:	Previously, the `oo-accept-broker` command would hang indefinitely or timeout with unhelpful errors if it was unable to load the broker's Rails environment due to database connectivity issues. This enhancement updates the `oo-accept-broker` to now check for connectivity with the MongoDB database before attempting other functions that rely on the database. As a result, `oo-accept-broker` now catches DNS issues as well as MongoDB database connectivity issues and provides a useful error message when the database is unavailable.
Clone Of:
Environment:
Last Closed:	2015-09-30 16:36:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1844	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 2.2.7 security, bug fix and enhancement update	2015-09-30 20:35:28 UTC

Comment 3 Eric Rich 2014-10-14 11:29:27 UTC

Where should the time out value be configured? 

It looks like /etc/openshift/broker.conf is one place this configuration could occur. 

Have we confirmed that nsupdate is the cause of the long delay? 

Do operations on the client side (oo-accept-broker) continue after 300 seconds (5 min)?

Comment 4 Javier Ramirez 2014-10-14 17:03:43 UTC

Hi Eric,

I managed to reproduce the issue thanks to pep++. The problem happens in the run_ruby_rails_command function while trying to get configuration values out of the rails environment.

I tried 3 differente approaches:
1. Missconfigure /etc/resolv.conf : The command fails with the same error but it finish in about 13s
2. Stop the named service in the dns server, or reject dns traffic with iptables, the command fails but it finish in about 48s
3. Shutdown the dns VM or dropping dns traffic with iptables, the command fails but finish in about 7m14s 

So actually there is really a timeout, but is really long, more than 7 minutes in my environment.

You can also reproduce the issue by running:
cd /var/www/openshift/broker && oo-exec-ruby rails console

Comment 5 Timothy Williams 2014-10-14 20:21:54 UTC

DNS is an absolutely necessary part of openshift. Without it, we do not expect anything to work. This is pretty much expected behavior.

That said, lets take a look at where we are hanging. As pointed out, we appear to be hanging in the 'run_ruby_rails_command' function, specifically when loading the broker rails environment:

function run_ruby_rails_command() {
[...]
      pushd /var/www/openshift/broker > /dev/null
      safe_bundle "ruby -I ${OSO_BROKER_ROOT} -r rubygems --" <<EOF
        begin
          require 'config/environment'
          require 'config/environments/$OSO_ENVIRONMENT'
          broker_conf = OpenShift::Config.new('$BROKER_CONF')
          console_conf = OpenShift::Config.new('$CONSOLE_CONF') if File.exists? '$CONSOLE_CONF'
        rescue Bundler::GemNotFound => gem_error
          \$stderr.puts "Error loading libs or gems: #{gem_error}"
          exit 1
        end
        $1
        exit 0
EOF

We can reproduce this easily by adding the 'EOF' contents to a file (titled ~/ruby_command in this example), replacing variables, and running the below command:

scl enable ruby193 v8314 "bundle exec ruby -I /var/www/openshift/broker/ -r rubygems -- ~/ruby_command"

The contents of the file are:

        begin
          require 'config/environment'
          require 'config/environments/production'
          broker_conf = OpenShift::Config.new('/etc/openshift/broker.conf')
          console_conf = OpenShift::Config.new('/etc/openshift/console.conf') if File.exists? '/etc/openshift/console.conf'
        rescue Bundler::GemNotFound => gem_error
          puts "Error loading libs or gems: OH NER GERM PERBLERMS"
          exit 1
        end
        $1
        exit 0
EOF

Running this while my named service is off, I can see that we try to load the broker rails environment and connect to the broker's database. The database connection fails as we cannot resolve the hostname for it. I can see the retry the connection 20 times, taking roughly 7 minutes:

# date; scl enable ruby193 v8314 "bundle exec ruby -I /var/www/openshift/broker/ -r rubygems -- ~/ruby_command"; date
Tue Oct 14 15:54:49 EDT 2014
[.. Error messages redacted ..]
Tue Oct 14 16:01:58 EDT 2014

This is not a timeout that was hit. We retried connecting to the mongo database a specific number of times and failed every time. The amount of time this takes depends on how long each try takes to fail.

We can reduce the number of retries by modifying the /var/www/openshift/broker/config/mongoid.yml file and adding a `max_retries: X` option:

production:
  sessions:
    default:
      <%
      default_config = { :host_port => 'localhost:27017', :user => 'openshift', :password => 'mooo', :db => 'openshift_broker' }
      config = defined?(Rails) ? Rails.application.config.datastore : default_config
      %>
      database: <%= config[:db] %>
      hosts:
        <% config[:host_port].split(',').each do |server| %>
        - <%= server %>
        <% end %>
      username: "<%= config[:user] %>"
      password: "<%= config[:password] %>"
      options:
        max_retries: 3
        consistency: :strong
        safe: true
        ssl: <%= config[:ssl] %>
        write:
          w: <%= config[:write_replicas] %>

After setting the max_retries to 3, It takes significantly less time for this to fail. Perhaps this should be lowered?

Comment 7 Luke Meyer 2014-10-22 18:29:44 UTC

Train people to run oo-diagnostics, not oo-accept-broker. If you run oo-diagnostics and DNS is sufficiently broken (DNS server not responding at all) then you find that out before you run a lot of other tests that will fail. It runs oo-accept-broker for you so why do anything else?

I don't know that we want to reduce mongo connection retries. I agree it shouldn't take oo-accept-broker 7+ minutes to give up on mongo in this scenario... just not clear how best to reduce that in the case where each attempt takes a while to timeout.

Comment 8 Jason DeTiberus 2014-10-27 14:32:15 UTC

In the past we've told users to configure oo-accept-broker as part of their standard health monitoring, with that in mind, it may make sense to short circuit on DNS related failures to prevent excessive runtimes.

Comment 15 openshift-github-bot 2015-08-26 18:59:49 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/adca6d8e0857f0d768587880b003c1005fd7b561
Check MongoDB hosts connectivty prior to loading broker rails environment

Bug 1152524
https://bugzilla.redhat.com/show_bug.cgi?id=1152524

oo-accept-broker: Check for connectivity with MongoDB hosts prior to attempting to load the broker's rails environment. Failing to do so previously resulted in oo-accept-broker waiting ~7 minutes for the broker database connection attempts to finally timeout.

Comment 22 Johnny Liu 2015-09-23 02:43:33 UTC

Verified this bug with openshift-origin-broker-util-1.36.2.2-1.el6op.noarch, and PASS.


When DNS request is dropped by iptables, get the following error, and time is shorten largely.

# time oo-accept-broker -v
<--snip-->
INFO: checking firewall settings
INFO: checking services
INFO: checking Mongo host connectivity
FAIL: Unable to connect to Mongo host: "datastore.ose-22.com.cn:27017"
FAIL: Unable to load broker application values, broker database is unreachable.
2 ERRORS

real	0m5.157s
user	0m3.549s
sys	0m0.542s


When iptable rules are restored to allow DNS access, it is working well.

# time oo-accept-broker -v
<--snip-->
INFO: checking firewall settings
INFO: checking services
INFO: checking Mongo host connectivity
INFO: Connected to Mongo host: "datastore.ose-22.com.cn:27017"
INFO: checking mongo datastore configuration
INFO: Datastore Host: datastore.ose-22.com.cn
INFO: Datastore Port: 27017
INFO: Datastore User: openshift
INFO: Datastore SSL: false
INFO: Datastore Password has been set to non-default
INFO: Datastore DB Name: openshift_broker
INFO: Datastore: mongo db service is remote
INFO: checking mongo db login access
<--snip-->

Comment 24 errata-xmlrpc 2015-09-30 16:36:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1844.html

Note You need to log in before you can comment on or make changes to this bug.