Where should the time out value be configured? It looks like /etc/openshift/broker.conf is one place this configuration could occur. Have we confirmed that nsupdate is the cause of the long delay? Do operations on the client side (oo-accept-broker) continue after 300 seconds (5 min)?
Hi Eric, I managed to reproduce the issue thanks to pep++. The problem happens in the run_ruby_rails_command function while trying to get configuration values out of the rails environment. I tried 3 differente approaches: 1. Missconfigure /etc/resolv.conf : The command fails with the same error but it finish in about 13s 2. Stop the named service in the dns server, or reject dns traffic with iptables, the command fails but it finish in about 48s 3. Shutdown the dns VM or dropping dns traffic with iptables, the command fails but finish in about 7m14s So actually there is really a timeout, but is really long, more than 7 minutes in my environment. You can also reproduce the issue by running: cd /var/www/openshift/broker && oo-exec-ruby rails console
DNS is an absolutely necessary part of openshift. Without it, we do not expect anything to work. This is pretty much expected behavior. That said, lets take a look at where we are hanging. As pointed out, we appear to be hanging in the 'run_ruby_rails_command' function, specifically when loading the broker rails environment: function run_ruby_rails_command() { [...] pushd /var/www/openshift/broker > /dev/null safe_bundle "ruby -I ${OSO_BROKER_ROOT} -r rubygems --" <<EOF begin require 'config/environment' require 'config/environments/$OSO_ENVIRONMENT' broker_conf = OpenShift::Config.new('$BROKER_CONF') console_conf = OpenShift::Config.new('$CONSOLE_CONF') if File.exists? '$CONSOLE_CONF' rescue Bundler::GemNotFound => gem_error \$stderr.puts "Error loading libs or gems: #{gem_error}" exit 1 end $1 exit 0 EOF We can reproduce this easily by adding the 'EOF' contents to a file (titled ~/ruby_command in this example), replacing variables, and running the below command: scl enable ruby193 v8314 "bundle exec ruby -I /var/www/openshift/broker/ -r rubygems -- ~/ruby_command" The contents of the file are: begin require 'config/environment' require 'config/environments/production' broker_conf = OpenShift::Config.new('/etc/openshift/broker.conf') console_conf = OpenShift::Config.new('/etc/openshift/console.conf') if File.exists? '/etc/openshift/console.conf' rescue Bundler::GemNotFound => gem_error puts "Error loading libs or gems: OH NER GERM PERBLERMS" exit 1 end $1 exit 0 EOF Running this while my named service is off, I can see that we try to load the broker rails environment and connect to the broker's database. The database connection fails as we cannot resolve the hostname for it. I can see the retry the connection 20 times, taking roughly 7 minutes: # date; scl enable ruby193 v8314 "bundle exec ruby -I /var/www/openshift/broker/ -r rubygems -- ~/ruby_command"; date Tue Oct 14 15:54:49 EDT 2014 [.. Error messages redacted ..] Tue Oct 14 16:01:58 EDT 2014 This is not a timeout that was hit. We retried connecting to the mongo database a specific number of times and failed every time. The amount of time this takes depends on how long each try takes to fail. We can reduce the number of retries by modifying the /var/www/openshift/broker/config/mongoid.yml file and adding a `max_retries: X` option: production: sessions: default: <% default_config = { :host_port => 'localhost:27017', :user => 'openshift', :password => 'mooo', :db => 'openshift_broker' } config = defined?(Rails) ? Rails.application.config.datastore : default_config %> database: <%= config[:db] %> hosts: <% config[:host_port].split(',').each do |server| %> - <%= server %> <% end %> username: "<%= config[:user] %>" password: "<%= config[:password] %>" options: max_retries: 3 consistency: :strong safe: true ssl: <%= config[:ssl] %> write: w: <%= config[:write_replicas] %> After setting the max_retries to 3, It takes significantly less time for this to fail. Perhaps this should be lowered?
Train people to run oo-diagnostics, not oo-accept-broker. If you run oo-diagnostics and DNS is sufficiently broken (DNS server not responding at all) then you find that out before you run a lot of other tests that will fail. It runs oo-accept-broker for you so why do anything else? I don't know that we want to reduce mongo connection retries. I agree it shouldn't take oo-accept-broker 7+ minutes to give up on mongo in this scenario... just not clear how best to reduce that in the case where each attempt takes a while to timeout.
In the past we've told users to configure oo-accept-broker as part of their standard health monitoring, with that in mind, it may make sense to short circuit on DNS related failures to prevent excessive runtimes.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/adca6d8e0857f0d768587880b003c1005fd7b561 Check MongoDB hosts connectivty prior to loading broker rails environment Bug 1152524 https://bugzilla.redhat.com/show_bug.cgi?id=1152524 oo-accept-broker: Check for connectivity with MongoDB hosts prior to attempting to load the broker's rails environment. Failing to do so previously resulted in oo-accept-broker waiting ~7 minutes for the broker database connection attempts to finally timeout.
Verified this bug with openshift-origin-broker-util-1.36.2.2-1.el6op.noarch, and PASS. When DNS request is dropped by iptables, get the following error, and time is shorten largely. # time oo-accept-broker -v <--snip--> INFO: checking firewall settings INFO: checking services INFO: checking Mongo host connectivity FAIL: Unable to connect to Mongo host: "datastore.ose-22.com.cn:27017" FAIL: Unable to load broker application values, broker database is unreachable. 2 ERRORS real 0m5.157s user 0m3.549s sys 0m0.542s When iptable rules are restored to allow DNS access, it is working well. # time oo-accept-broker -v <--snip--> INFO: checking firewall settings INFO: checking services INFO: checking Mongo host connectivity INFO: Connected to Mongo host: "datastore.ose-22.com.cn:27017" INFO: checking mongo datastore configuration INFO: Datastore Host: datastore.ose-22.com.cn INFO: Datastore Port: 27017 INFO: Datastore User: openshift INFO: Datastore SSL: false INFO: Datastore Password has been set to non-default INFO: Datastore DB Name: openshift_broker INFO: Datastore: mongo db service is remote INFO: checking mongo db login access <--snip-->
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1844.html