Description of problem: ------------------------ I'm running into Proxy error ~48 hours after having built my appliance. I have enabled capacity and utilization collection on my appliance. The appliance is managing two infra providers : RHEV and vsphere. Memory usage from the output of top : Mem: 5992864k total, 5682332k used, 310532k free, 20456k buffers Swap: 9957372k total, 109800k used, 9847572k free, 724636k cached Version-Release number of selected component (if applicable): ------------------------ 5.4.0.0.11 How reproducible: ---------------- Unsure Steps to Reproduce: ----------------- 1. 2. 3. Actual results: --------------- Expected results: ---------------- Additional info: ----------------
1)Top output PID USER VIRT PR NI RES SHR S %CPU %MEM TIME+ SWAP COMMAND 16149 root 4698m 23 3 1.2g 2244 S 0.0 21.6 99:27.29 850m ruby 2)Snippet from evm.log, [----] I, [2015-03-30T04:03:00.949158 #7424:c7be9c] INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000018], PID [16149], GUID [6f5a25bc-d3bd-11e4-bc18-00505696bd7e], Last Heartbeat [2015-03-30 08:02:48 UTC], Process Info: Memory Usage [1347833856], Memory Size [4433350656], Memory % [21.96], CPU Time [594992.0], CPU % [1.69], Prpriority [23] Based on the above,vimbroker has swapped 850 mb of memory and its resident memory has grown to 1.2gb.
I also see this error in evm.log [----] I, [2015-03-30T06:38:27.231344 #24876:74fea0] INFO -- : MIQ(EmsRefresh::Refreshers::RhevmRefresher.post_refresh) EMS: [RHEV 3.4], id: [10000000000002] Performing post-refresh operations for VmOrTemplate instances... [----] I, [2015-03-30T06:38:27.246208 #24876:74fea0] INFO -- : MIQ(EmsRefresh::Refreshers::RhevmRefresher.post_refresh) EMS: [RHEV 3.4], id: [10000000000002] Performing post-refresh operations for VmOrTemplate instances...Complete [----] I, [2015-03-30T06:38:27.246379 #24876:74fea0] INFO -- : MIQ(EmsRefresh::Refreshers::RhevmRefresher.refresh) Refreshing all targets...Complete [----] I, [2015-03-30T06:38:27.246893 #24876:74fea0] INFO -- : MIQ(MiqQueue.delivered) Message id: [10000000245942], State: [ok], Delivered in [6.693636865] seconds [----] I, [2015-03-30T06:38:27.274594 #24876:74fea0] INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [10000000245947], MiqWorker id: [10000000000782], Zone: [default], Role: [ems_inventory], Server: [], Ident: [ems_10000000000002], Target id: [], Instance id: [], Task id: [], Command: [EmsRefresh.refresh], Timeout: [7200], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [[["VmRedhat", 10000000001181], ["EmsRedhat", 10000000000002]]], Dequeued in: [4.819344409] seconds [----] I, [2015-03-30T06:38:27.274800 #24876:74fea0] INFO -- : MIQ(MiqQueue.deliver) Message id: [10000000245947], Delivering... [----] E, [2015-03-30T06:38:27.283882 #24876:74fea0] ERROR -- : MIQ(MiqQueue.deliver) Message id: [10000000245947], Error: [undefined method `emstype' for nil:NilClass] [----] E, [2015-03-30T06:38:27.284168 #24876:74fea0] ERROR -- : [NoMethodError]: undefined method `emstype' for nil:NilClass Method:[rescue in deliver] [----] E, [2015-03-30T06:38:27.284308 #24876:74fea0] ERROR -- : /var/www/miq/vmdb/app/models/ems_refresh.rb:73:in `block in refresh' /var/www/miq/vmdb/app/models/ems_refresh.rb:67:in `each' /var/www/miq/vmdb/app/models/ems_refresh.rb:67:in `group_by' /var/www/miq/vmdb/app/models/ems_refresh.rb:67:in `refresh' /var/www/miq/vmdb/app/models/miq_queue.rb:356:in `block in deliver' /opt/rh/ruby200/root/usr/share/ruby/timeout.rb:66:in `timeout' /var/www/miq/vmdb/app/models/miq_queue.rb:352:in `deliver' /var/www/miq/vmdb/lib/workers/queue_worker_base.rb:107:in `deliver_queue_message' /var/www/miq/vmdb/lib/workers/queue_worker_base.rb:135:in `deliver_message' /var/www/miq/vmdb/lib/workers/queue_worker_base.rb:152:in `block in do_work' /var/www/miq/vmdb/lib/workers/queue_worker_base.rb:146:in `loop' /var/www/miq/vmdb/lib/workers/queue_worker_base.rb:146:in `do_work' /var/www/miq/vmdb/lib/workers/worker_base.rb:317:in `block in do_work_loop' /var/www/miq/vmdb/lib/workers/worker_base.rb:314:in `loop' /var/www/miq/vmdb/lib/workers/worker_base.rb:314:in `do_work_loop' /var/www/miq/vmdb/lib/workers/worker_base.rb:135:in `run' /var/www/miq/vmdb/lib/workers/worker_base.rb:123:in `start' /var/www/miq/vmdb/lib/workers/worker_base.rb:23:in `start_worker' /var/www/miq/vmdb/lib/workers/bin/worker.rb:5:in `<top (required)>'
Nandini, I ssh'd in and I don't see the logs showing "error reading from remote server." Is there anything special in terms of what the vmware broker is doing? For example, it sounds like you have only 1 vsphere. Are you seeing lots of events coming in? (perhaps the broker is leaking when getting updates for events)? I do see that we are growing in virtual memory size of the broker process: [----] I, [2015-03-28T03:33:04.805308 #7424:c7be9c] INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000018], PID [16149], GUID [6f5a25bc-d3bd-11e4-bc18-00505696bd7e], Last Heartbeat [2015-03-28 07:32:57 UTC], Process Info: Memory Usage [549744640], Memory Size [1576398848], Memory % [8.96], CPU Time [587907.0], CPU % [3.23], Priority [23] [----] I, [2015-03-29T03:22:51.937815 #7424:c7be9c] INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000018], PID [16149], GUID [6f5a25bc-d3bd-11e4-bc18-00505696bd7e], Last Heartbeat [2015-03-29 07:22:37 UTC], Process Info: Memory Usage [1241198592], Memory Size [3004997632], Memory % [20.23], CPU Time [591192.0], CPU % [2.21], Priority [23] [----] I, [2015-03-30T03:07:48.479798 #7424:c7be9c] INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000018], PID [16149], GUID [6f5a25bc-d3bd-11e4-bc18-00505696bd7e], Last Heartbeat [2015-03-30 07:07:27 UTC], Process Info: Memory Usage [1369726976], Memory Size [4376395776], Memory % [22.32], CPU Time [594849.0], CPU % [1.7], Priority [23]
Joe, That's correct.The appliance is managing only 1 vsphere provider.I am not aware of anything special that the vmware broker is doing. I did the check the number of events corresponding to the vsphere provider since 3/29/2015. vmdb_production=# select count(*) from ems_events where ems_id = 10000000000001; count ------- 29222 (1 row) Also,I rebooted the appliance on 3/30 after which the proxy error went away.
I looked at the appliance again and now the broker seems stable. I ran this: grep -E "MiqVimBrokerWorker.+Memory Size" log/evm.log first line: [----] I, [2015-03-31T03:34:36.576145 #2235:8cdea8] INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000903], PID [2267], GUID [d2a265a8-d700-11e4-b8f5-00505696bd7e], Last Heartbeat [2015-03-31 07:34:21 UTC], Process Info: Memory Usage [311226368], Memory Size [1164423168], Memory % [5.07], CPU Time [235519.0], CPU % [4.63], Priority [23] last line (13 hours later) [----] I, [2015-03-31T16:54:40.223722 #2230:5a3ea8] INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000903], PID [2267], GUID [d2a265a8-d700-11e4-b8f5-00505696bd7e], Last Heartbeat [2015-03-31 20:54:33 UTC], Process Info: Memory Usage [317313024], Memory Size [1170448384], Memory % [5.17], CPU Time [446312.0], CPU % [4.54], Priority [23] The memory size hasn't changed very much in 13 hours. Can you try to see if you can reliably reproduce it? If we reproduce it, we might be able to dump the object stats that ruby knows of and narrow down the scope of the growing virtual memory size.
Joe, The memory size for VimBrokerWorker seems to have gone up.The appliance has been up since 2015-03-31. #grep -E "MiqVimBrokerWorker.+Memory Size" log/evm.log [----] I, [2015-04-06T12:09:45.000498 #2235:8cdea8] INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000 002012], PID [25469], GUID [4f052bfa-db28-11e4-b8f5-00505696bd7e], Last Heartbeat [2015-04-06 16:09:37 UTC], Process Info: Memory Usage [1010511872], Memory Size [2632753152], Memory % [16.47], CPU Time [97269.0], CPU % [0.16], Priority [23]
I've recreated this locally using a very recent nightly centos upstream build. It's interesting to note that we have lots of CLOSE_WAIT file descriptors for the vim broker worker process. 1574 is the vim broker process id, (36692 is the DRb port the broker uses: [root@10-16-7-74 vmdb]# lsof -p 1574 |grep 36692 ruby 1574 root 21u IPv4 1942300 0t0 TCP localhost:36692 (LISTEN) ruby 1574 root 23u IPv4 1942578 0t0 TCP localhost:36692->localhost:55896 (CLOSE_WAIT) ruby 1574 root 25u IPv4 1990166 0t0 TCP localhost:36692->localhost:36085 (CLOSE_WAIT) ruby 1574 root 26u IPv4 1960686 0t0 TCP localhost:36692->localhost:59410 (ESTABLISHED) ruby 1574 root 27u IPv4 1956682 0t0 TCP localhost:36692->localhost:58843 (CLOSE_WAIT) ruby 1574 root 28u IPv4 1957109 0t0 TCP localhost:36692->localhost:58898 (CLOSE_WAIT) ruby 1574 root 30u IPv4 1960745 0t0 TCP localhost:36692->localhost:59425 (CLOSE_WAIT) ruby 1574 root 31u IPv4 1960748 0t0 TCP localhost:36692->localhost:59426 (CLOSE_WAIT) ruby 1574 root 32u IPv4 1961370 0t0 TCP localhost:36692->localhost:59582 (CLOSE_WAIT) ruby 1574 root 33u IPv4 1963132 0t0 TCP localhost:36692->localhost:59876 (CLOSE_WAIT) ruby 1574 root 34u IPv4 1963134 0t0 TCP localhost:36692->localhost:59877 (CLOSE_WAIT) ruby 1574 root 35u IPv4 1965793 0t0 TCP localhost:36692->localhost:60419 (CLOSE_WAIT) ruby 1574 root 36u IPv4 1965794 0t0 TCP localhost:36692->localhost:60418 (CLOSE_WAIT) ruby 1574 root 37u IPv4 1967377 0t0 TCP localhost:36692->localhost:60567 (CLOSE_WAIT) ruby 1574 root 38u IPv4 1967387 0t0 TCP localhost:36692->localhost:60568 (CLOSE_WAIT) ruby 1574 root 39u IPv4 1968849 0t0 TCP localhost:36692->localhost:60657 (CLOSE_WAIT) ruby 1574 root 40u IPv4 1968852 0t0 TCP localhost:36692->localhost:60658 (CLOSE_WAIT) ruby 1574 root 41u IPv4 1971497 0t0 TCP localhost:36692->localhost:32983 (CLOSE_WAIT) ruby 1574 root 42u IPv4 1971507 0t0 TCP localhost:36692->localhost:32984 (CLOSE_WAIT) ruby 1574 root 43u IPv4 1977183 0t0 TCP localhost:36692->localhost:33631 (CLOSE_WAIT) ruby 1574 root 44u IPv4 1977294 0t0 TCP localhost:36692->localhost:33642 (CLOSE_WAIT) ruby 1574 root 45u IPv4 1977300 0t0 TCP localhost:36692->localhost:33643 (CLOSE_WAIT) ruby 1574 root 46u IPv4 1977308 0t0 TCP localhost:36692->localhost:33644 (CLOSE_WAIT) ruby 1574 root 47u IPv4 1980418 0t0 TCP localhost:36692->localhost:34238 (CLOSE_WAIT) ruby 1574 root 48u IPv4 1980428 0t0 TCP localhost:36692->localhost:34239 (CLOSE_WAIT) ruby 1574 root 49u IPv4 1983904 0t0 TCP localhost:36692->localhost:35088 (CLOSE_WAIT) ruby 1574 root 50u IPv4 1983914 0t0 TCP localhost:36692->localhost:35089 (CLOSE_WAIT) ruby 1574 root 51u IPv4 1985284 0t0 TCP localhost:36692->localhost:35180 (CLOSE_WAIT) ruby 1574 root 52u IPv4 1985290 0t0 TCP localhost:36692->localhost:35181 (CLOSE_WAIT) ruby 1574 root 53u IPv4 1986927 0t0 TCP localhost:36692->localhost:35351 (CLOSE_WAIT) ruby 1574 root 54u IPv4 1986933 0t0 TCP localhost:36692->localhost:35352 (CLOSE_WAIT) ruby 1574 root 55u IPv4 1990176 0t0 TCP localhost:36692->localhost:36086 (CLOSE_WAIT) ruby 1574 root 56u IPv4 1993521 0t0 TCP localhost:36692->localhost:36836 (CLOSE_WAIT) ruby 1574 root 57u IPv4 1994095 0t0 TCP localhost:36692->localhost:36948 (CLOSE_WAIT) ruby 1574 root 58u IPv4 1994098 0t0 TCP localhost:36692->localhost:36949 (CLOSE_WAIT) ruby 1574 root 59u IPv4 1997517 0t0 TCP localhost:36692->localhost:37735 (CLOSE_WAIT) ruby 1574 root 60u IPv4 1997520 0t0 TCP localhost:36692->localhost:37736 (CLOSE_WAIT) ruby 1574 root 61u IPv4 2000181 0t0 TCP localhost:36692->localhost:38252 (CLOSE_WAIT) ruby 1574 root 62u IPv4 2000191 0t0 TCP localhost:36692->localhost:38253 (CLOSE_WAIT) ruby 1574 root 63u IPv4 2003734 0t0 TCP localhost:36692->localhost:39124 (CLOSE_WAIT) ruby 1574 root 64u IPv4 2003744 0t0 TCP localhost:36692->localhost:39125 (CLOSE_WAIT) ruby 1574 root 65u IPv4 2007717 0t0 TCP localhost:36692->localhost:40121 (CLOSE_WAIT) ruby 1574 root 66u IPv4 2007724 0t0 TCP localhost:36692->localhost:40122 (CLOSE_WAIT) ruby 1574 root 67u IPv4 2011000 0t0 TCP localhost:36692->localhost:40759 (CLOSE_WAIT) ruby 1574 root 68u IPv4 2011018 0t0 TCP localhost:36692->localhost:40762 (CLOSE_WAIT) ruby 1574 root 69u IPv4 2013323 0t0 TCP localhost:36692->localhost:41316 (CLOSE_WAIT) ruby 1574 root 70u IPv4 2013593 0t0 TCP localhost:36692->localhost:41329 (ESTABLISHED) ruby 1574 root 71u IPv4 2016131 0t0 TCP localhost:36692->localhost:41603 (CLOSE_WAIT) ruby 1574 root 72u IPv4 2016134 0t0 TCP localhost:36692->localhost:41604 (CLOSE_WAIT) ruby 1574 root 73u IPv4 2018926 0t0 TCP localhost:36692->localhost:42167 (CLOSE_WAIT) ruby 1574 root 74u IPv4 2018936 0t0 TCP localhost:36692->localhost:42168 (CLOSE_WAIT) ruby 1574 root 75u IPv4 2020485 0t0 TCP localhost:36692->localhost:42252 (ESTABLISHED) ruby 1574 root 76u IPv4 2020490 0t0 TCP localhost:36692->localhost:42253 (ESTABLISHED)
After instrumenting an appliance with logging[1], we see connects always followed later by a disconnect for the same process in the miq_fault_tolerant_vim code, until we run into a soap fault error: [Handsoap::Fault]: Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct. After this fault error occurs, we only see logging of the connect at the miq_fault_tolerant_vim level, no disconnect seem to occur. It's not clear this is causing the memory increase but it's very interesting. We also grep'd QE's problematic appliances also exhibiting the increased broker memory usage and this error "A specified parameter was not correct" also occurs there. We may not be recovering from this error properly and possibly deadlocking the broker. [1] diff --git a/lib/util/miq_fault_tolerant_vim.rb b/lib/util/miq_fault_tolerant_vim.rb index 388a3c7..3ffd476 100644 --- a/lib/util/miq_fault_tolerant_vim.rb +++ b/lib/util/miq_fault_tolerant_vim.rb @@ -80,6 +81,9 @@ class MiqFaultTolerantVim end def _execute_with_broker(state) + msg = caller.join("\nXXX ") + $log.info("XXX #{__method__} #{state} #{msg}") + log_header = "MIQ(#{self.class.name}._execute_with_broker)" retries = 0 begin
Grep'd from a QE appliance running ruby 2.0.0, the backtrace is nearly identical. log/evm.log-20150419:[----] E, [2015-04-18T11:56:55.299576 #22535:a31eac] ERROR -- : [Handsoap::Fault]: Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct. log/evm.log-20150419-' } Method:[rescue in deliver] log/evm.log-20150419-[----] E, [2015-04-18T11:56:55.299645 #22535:a31eac] ERROR -- : (druby://127.0.0.1:52244) /opt/rh/cfme-gemset/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:198:in `on_fault' log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/cfme-gemset/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:288:in `dispatch' log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/cfme-gemset/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:192:in `invoke' log/evm.log-20150419-(druby://127.0.0.1:52244) /var/www/miq/lib/VMwareWebService/VimService.rb:676:in `queryPerfProviderSummary' log/evm.log-20150419-(druby://127.0.0.1:52244) /var/www/miq/lib/VMwareWebService/MiqVimPerfHistory.rb:100:in `queryProviderSummary' log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1588:in `perform_without_block' log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1548:in `perform' log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1626:in `block (2 levels) in main_loop' log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1622:in `loop' log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1622:in `block in main_loop' log/evm.log-20150419-(druby://127.0.0.1:52244) /var/www/miq/vmdb/lib/extensions/ar_thread.rb:22:in `block in start_with_release' log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/capture/vim.rb:40:in `realtime_interval' log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:88:in `block in perf_capture_intervals' log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:86:in `each' log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:86:in `perf_capture_intervals' log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:46:in `block in perf_collect_metrics_vim' log/evm.log-20150419-/var/www/miq/lib/util/extensions/miq-benchmark.rb:12:in `realtime_store' log/evm.log-20150419-/var/www/miq/lib/util/extensions/miq-benchmark.rb:31:in `realtime_block' log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:46:in `perf_collect_metrics_vim'
Created attachment 1017159 [details] Spreadsheet with broker memory grow as reported by top_output
Created attachment 1017160 [details] Graph of vim broker memory zoomed out Shows the slow growth of both REST and VIRT memory for the vim broker process, where the growth appears to start "roughly" at the same time we last we see "_execute_with_broker" called with "on_disconnect" [1] [1] https://github.com/ManageIQ/manageiq/blob/9adecd6b0ebc2b1e4dfb045a97f53c6d2276b3e6/lib/util/miq_fault_tolerant_vim.rb#L83 Prior to this point, we see 1) "_execute_with_broker" called with "on_connect" 2) zero or more "_execute_with_broker" called with "on_execute" 3) "_execute_with_broker" called with "on_disconnect" After this point, we see only: 1) "_execute_with_broker" called with "on_connect"
Created attachment 1017162 [details] Zoomed in graph of the vim broker memory usage This graph is zoomed in and we can see clearly that we see the last miq_fault_tolerant_vim "on_disconnect" (and only "on_connect thereafter) occurs at 19:15:24, memory stays relatively stable but then at 19:30:55, we see VIRT memory start to rise as reported by top. Note, it looks like my appliance has a 15 minute tcp retries timeout, see below: http://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux " tcp_retries2 - INTEGER The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout." # cat /proc/sys/net/ipv4/tcp_retries2 15
Other findings: 1) we see "[Handsoap::Fault]: Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct." error many times around the "last" time we see the miq_fault_tolerant_vim try to disconnect (2015-04-20T19:15:24.095238) 2) We see many Handsoap::Fault errors like above A total of 9497 Handsoap::Fault in evm.log-20150421. Of which 9494 are 'A specified parameter was not correct.' Handsoap::Fault errors. Many thousands are before 20T19:15:24.
Rich, I can send you any logs or graphs you may need to help. At this point, we're trying unsuccessfully to recreate this issue with MiqVimBrokerClient/MiqVimBrokerServer combination locally on our appliance. Any thoughts what we can try next? Jason is suspecting we might be hitting a broker deadlock but clearly just running this broker over night with a single vsphere causes several CLOSE_WAITs (hundreds) and VIRT memory doubles from 1000m to 2000m. We'd like to narrow it down outside of the full worker setup but haven't had any luck yet.
Created attachment 1017697 [details] 2015-04-22T19-58-34_zoomed_out (all time samples)
Created attachment 1017698 [details] 2015-04-22T19-58-34_zoomed_in (1 hour before and 2 hours after last good vim broker client connect)
Did you run with SyncDebug?
RE: using httpclient 2.2.7, the problem still exists, the symptoms of growing VIRT and RES memory and CLOSE_WAIT TCP connections on the DRb port is identical with 2.2.7 as it was with 2.5.3.3. Note, we shipped 2.2.7 with our first release dating back to the cfme 5.2.0.37 build: https://access.redhat.com/errata/RHEA-2013:1487
At this point, we have two major problems to fix: 1) isolate and fix the broker lock problem 2) improve the broker client code so that the other workers can report the broker as unavailable if they're queue task times them out while they're trying to connect to vim via the broker. The broker should not remain up and consuming memory if it's not accomplishing any useful work. It should be killed and a new broker started.
Here is Rich's added logging a lock timing out: [----] E, [2015-04-27T17:35:35.153293 #1634:4cba1fc] ERROR -- : VimSyncDebug - Lock timeout: thread 78668700 has held MiqVimBroker#lockHash[vsphere_HOSTNAME_administrator] for 417 seconds Here is the thread waiting on the lock: [----] E, [2015-04-27T17:35:35.153405 #1634:4cba1fc] ERROR -- : VimSyncDebug - Lock acquisition: Start backtrace [----] E, [2015-04-27T17:35:35.153522 #1634:4cba1fc] ERROR -- : /var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:368:in `connSync' /var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:412:in `getMiqVim' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1588:in `perform_without_block' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1548:in `perform' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1626:in `block (2 levels) in main_loop' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1622:in `loop' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1622:in `block in main_loop' /var/www/miq/vmdb/lib/extensions/ar_thread.rb:22:in `block in start_with_release' [----] E, [2015-04-27T17:35:35.153667 #1634:4cba1fc] ERROR -- : VimSyncDebug - Lock acquisition: End backtrace Here is the locking thread backtrace: [----] E, [2015-04-27T17:35:35.153782 #1634:4cba1fc] ERROR -- : VimSyncDebug - Locking thread: Start backtrace [----] E, [2015-04-27T17:35:35.154287 #1634:4cba1fc] ERROR -- : /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/timeout.rb:70:in `join' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/timeout.rb:70:in `timeout' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/timeout.rb:97:in `timeout' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:762:in `connect' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:620:in `query' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:164:in `query' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1161:in `do_get_block' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:962:in `block in do_request' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1059:in `protect_keep_alive_disconnected' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:961:in `do_request' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:823:in `request' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:726:in `post' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:280:in `dispatch' /opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:192:in `invoke' /var/www/miq/lib/VMwareWebService/VimService.rb:865:in `retrieveProperties' /var/www/miq/lib/VMwareWebService/MiqVimInventory.rb:1955:in `getMoProp_local' /var/www/miq/lib/VMwareWebService/MiqVimInventory.rb:1992:in `getMoProp' /var/www/miq/lib/VMwareWebService/MiqVimInventory.rb:370:in `currentSession' /var/www/miq/lib/VMwareWebService/MiqVimInventory.rb:376:in `isAlive?' /var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:415:in `block in getMiqVim' /var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:374:in `call' /var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:374:in `connSync' /var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:412:in `getMiqVim' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1588:in `perform_without_block' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1548:in `perform' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1626:in `block (2 levels) in main_loop' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1622:in `loop' /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1622:in `block in main_loop' /var/www/miq/vmdb/lib/extensions/ar_thread.rb:22:in `block in start_with_release' [----] E, [2015-04-27T17:35:35.154429 #1634:4cba1fc] ERROR -- : VimSyncDebug - Locking thread: End backtrace
Upstream PR with Timeout.timeout re-implementation for the broker: https://github.com/ManageIQ/manageiq/pull/2759
See upstream PR for discussion.
New commit detected on manageiq/master: https://github.com/ManageIQ/manageiq/commit/5a4b8473a45e721a79a2b46325037808e30977c7 commit 5a4b8473a45e721a79a2b46325037808e30977c7 Author: Richard Oliveri <roliveri> AuthorDate: Mon Apr 27 14:40:41 2015 -0400 Commit: Richard Oliveri <roliveri> CommitDate: Fri May 1 12:35:24 2015 -0400 Added additional lock debugging in broker. Additional lock debugging added in order to resolve broker thread hang issue. https://bugzilla.redhat.com/show_bug.cgi?id=1207018 lib/VMwareWebService/MiqVimBroker.rb | 46 +++++++++++++--- lib/VMwareWebService/MiqVimInventory.rb | 5 +- lib/VMwareWebService/VimSyncDebug.rb | 81 ++++------------------------- lib/VMwareWebService/broker_sync_debug.rb | 33 ++++++++++++ lib/VMwareWebService/vim_base_sync_debug.rb | 68 ++++++++++++++++++++++++ 5 files changed, 155 insertions(+), 78 deletions(-) create mode 100644 lib/VMwareWebService/broker_sync_debug.rb create mode 100644 lib/VMwareWebService/vim_base_sync_debug.rb
Verified the following: 1)vim broker doesn't leak drb tcp file descriptor references 2)Verified through top -H -p <pid of broker>,that the broker process has 20 threads and that the number of threads doesn't grow to several hundreds. 3)My appliance has been running for 5 days,there are no proxy errors and the web UI has remained accessible since then. Verified in 5.6.0.0.26
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1100.html