Description of problem: oo-admin-chk level 1 times out Version-Release number of selected component (if applicable): openshift-origin-broker-util-1.10.6-1.el6oso.noarch How reproducible: always, in openshift.com production Steps to Reproduce: 1. execute oo-admin-chk --level 1 on a node 2. wait 40-150 minutes Actual results: Stack trace: Started at: 2013-07-17 10:37:49 -0400 /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/networking.rb:306:in `rescue in receive_message_on_socket': Operation failed with the following exception: Connection timed out (Mongo::ConnectionFailure) from /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/networking.rb:298:in `receive_message_on_socket' from /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/networking.rb:159:in `receive_header' from /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/networking.rb:150:in `receive' from /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/networking.rb:117:in `receive_message' from /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/cursor.rb:529:in `send_get_more' from /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/cursor.rb:463:in `refresh' from /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/cursor.rb:124:in `next' from /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/cursor.rb:285:in `each' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/lib/openshift/data_store.rb:23:in `block in find' from /opt/rh/ruby193/root/usr/local/share/gems/gems/mongo-1.8.1/lib/mongo/collection.rb:276:in `find' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/lib/openshift/data_store.rb:22:in `find' from /usr/sbin/oo-admin-chk:245:in `<main>' Expected results: oo-admin-chk level 1 report, in a more reasonable timeframe Additional info:
If you take away all the bloat inside the block you get: >> require "/var/www/openshift/broker/config/environment" => true >> ?> start_time = Time.now => 2013-07-17 09:45:00 -0400 >> apps = [] => [] >> app_selection = {:fields => ["name", "uuid", "created_at", "domain_id", "group_instances.gears._id","group_instances.gears.uuid", "group_instances.gears.uid", "group_instances.gears.server_identity", "group_instances._id", "component_instances._id", "component_instances.cartridge_name", "component_instances.group_instance_id", "group_overrides", "app_ssh_keys.name", "app_ssh_keys.content"], :timeout => false} => {:fields=>["name", "uuid", "created_at", "domain_id", "group_instances.gears._id", "group_instances.gears.uuid", "group_instances.gears.uid", "group_instances.gears.server_identity", "group_instances._id", "component_instances._id", "component_instances.cartridge_name", "component_instances.group_instance_id", "group_overrides", "app_ssh_keys.name", "app_ssh_keys.content"], :timeout=>false} >> app_query = {"group_instances.gears.0" => {"$exists" => true}} => {"group_instances.gears.0"=>{"$exists"=>true}} >> OpenShift::DataStore.find(:applications, app_query, app_selection) do |app| ?> apps << app >> end => nil >> ?> puts apps.size 149766 => nil >> puts Time.now - start_time 130.756245042
About time we employ multiple threads/processes.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/2c3c908560fa6e28588425791605e79d20ee2cfd Fix for bug 985496
Verified on devenv_3588 The time it takes to query apps is much saved. There is no timeout reported when running oo-admin-chk on level 1. irb(main):001:0> require "/var/www/openshift/broker/config/environment" => true irb(main):002:0> start_time = Time.now => 2013-07-30 22:41:11 -0400 irb(main):003:0> apps = [] => [] irb(main):004:0> app_selection = {:fields => ["name", "uuid", "created_at", "domain_id", "group_instances.gears._id","group_instances.gears.uuid", "group_instances.gears.uid", "group_instances.gears.server_identity", "group_instances._id", "component_instances._id", "component_instances.cartridge_name", "component_instances.group_instance_id", "group_overrides", "app_ssh_keys.name", "app_ssh_keys.content"], :timeout => false} => {:fields=>["name", "uuid", "created_at", "domain_id", "group_instances.gears._id", "group_instances.gears.uuid", "group_instances.gears.uid", "group_instances.gears.server_identity", "group_instances._id", "component_instances._id", "component_instances.cartridge_name", "component_instances.group_instance_id", "group_overrides", "app_ssh_keys.name", "app_ssh_keys.content"], :timeout=>false} irb(main):005:0> app_query = {"group_instances.gears.0" => {"$exists" => true}} => {"group_instances.gears.0"=>{"$exists"=>true}} irb(main):006:0> OpenShift::DataStore.find(:applications, app_query, app_selection) do |app| irb(main):007:1* apps << app irb(main):008:1> end => nil irb(main):009:0> puts apps.size 2 => nil irb(main):010:0> puts Time.now - start_time 48.302894661
Marking it as verified again. Will reopen if its still a problem in PROD.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/d68e54680165a75b14c60dbf1852d2ec9082c1d2 Fix for bug 985496
This bug is verified on devenv-stage_439