1207018 – [perf] vim broker leaking drb tcp file descriptor references

Bug 1207018 - [perf] vim broker leaking drb tcp file descriptor references

Summary: [perf] vim broker leaking drb tcp file descriptor references

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	GA
Target Release:	5.4.0
Assignee:	Joe Rafaniello
QA Contact:	Nandini Chandra
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-03-30 05:03 UTC by Nandini Chandra
Modified:	2015-06-17 01:58 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-06-16 12:56:01 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Spreadsheet with broker memory grow as reported by top_output (111.00 KB, application/x-ole-storage) 2015-04-21 22:30 UTC, Joe Rafaniello	no flags	Details
Graph of vim broker memory zoomed out (70.66 KB, image/png) 2015-04-21 22:36 UTC, Joe Rafaniello	no flags	Details
2015-04-22T19-58-34_zoomed_out (all time samples) (70.72 KB, image/png) 2015-04-23 03:46 UTC, Joe Rafaniello	no flags	Details
2015-04-22T19-58-34_zoomed_in (1 hour before and 2 hours after last good vim broker client connect) (65.54 KB, image/png) 2015-04-23 03:47 UTC, Joe Rafaniello	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1210480	high	CLOSED	ESXi host smartstate analysis fails	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1232490	high	CLOSED	[RFE] prevent too many workers on an appliance	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2015:1100	normal	SHIPPED_LIVE	CFME 5.4.0 bug fixes, and enhancement update	2015-06-16 16:28:42 UTC

Internal Links: 1210480 1232490

Description Nandini Chandra 2015-03-30 05:03:49 UTC

Description of problem:
------------------------
I'm running into Proxy error ~48 hours after having built my appliance. 
I have enabled capacity and utilization collection on my appliance.
The appliance is managing two infra providers : RHEV and vsphere.

Memory usage from the output of top :

Mem:   5992864k total,  5682332k used,   310532k free,    20456k buffers
Swap:  9957372k total,   109800k used,  9847572k free,   724636k cached


Version-Release number of selected component (if applicable):
------------------------
5.4.0.0.11


How reproducible:
----------------
Unsure 


Steps to Reproduce:
-----------------
1.
2.
3.


Actual results:
---------------


Expected results:
----------------


Additional info:
----------------

Comment 6 Nandini Chandra 2015-03-30 17:05:10 UTC

1)Top output

PID USER      VIRT  PR  NI  RES  SHR S %CPU %MEM    TIME+  SWAP COMMAND                                                                                                        
16149 root     4698m  23   3 1.2g 2244 S  0.0 21.6  99:27.29 850m ruby   


2)Snippet from evm.log,

[----] I, [2015-03-30T04:03:00.949158 #7424:c7be9c]  INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000018], PID [16149], GUID [6f5a25bc-d3bd-11e4-bc18-00505696bd7e], Last Heartbeat [2015-03-30 08:02:48 UTC], Process Info: Memory Usage [1347833856], Memory Size [4433350656], Memory % [21.96], CPU Time [594992.0], CPU % [1.69], Prpriority [23]   

Based on the above,vimbroker has swapped 850 mb of memory and its resident memory has grown to 1.2gb.

Comment 7 Nandini Chandra 2015-03-30 17:06:38 UTC

I also see this error in evm.log

[----] I, [2015-03-30T06:38:27.231344 #24876:74fea0]  INFO -- : MIQ(EmsRefresh::Refreshers::RhevmRefresher.post_refresh) EMS: [RHEV 3.4], id: [10000000000002] Performing post-refresh operations for VmOrTemplate instances...
[----] I, [2015-03-30T06:38:27.246208 #24876:74fea0]  INFO -- : MIQ(EmsRefresh::Refreshers::RhevmRefresher.post_refresh) EMS: [RHEV 3.4], id: [10000000000002] Performing post-refresh operations for VmOrTemplate instances...Complete
[----] I, [2015-03-30T06:38:27.246379 #24876:74fea0]  INFO -- : MIQ(EmsRefresh::Refreshers::RhevmRefresher.refresh) Refreshing all targets...Complete
[----] I, [2015-03-30T06:38:27.246893 #24876:74fea0]  INFO -- : MIQ(MiqQueue.delivered)  Message id: [10000000245942], State: [ok], Delivered in [6.693636865] seconds
[----] I, [2015-03-30T06:38:27.274594 #24876:74fea0]  INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [10000000245947], MiqWorker id: [10000000000782], Zone: [default], Role: [ems_inventory], Server: [], Ident: [ems_10000000000002], Target id: [], Instance id: [], Task id: [], Command: [EmsRefresh.refresh], Timeout: [7200], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [[["VmRedhat", 10000000001181], ["EmsRedhat", 10000000000002]]], Dequeued in: [4.819344409] seconds
[----] I, [2015-03-30T06:38:27.274800 #24876:74fea0]  INFO -- : MIQ(MiqQueue.deliver)    Message id: [10000000245947], Delivering...
[----] E, [2015-03-30T06:38:27.283882 #24876:74fea0] ERROR -- : MIQ(MiqQueue.deliver)    Message id: [10000000245947], Error: [undefined method `emstype' for nil:NilClass]
[----] E, [2015-03-30T06:38:27.284168 #24876:74fea0] ERROR -- : [NoMethodError]: undefined method `emstype' for nil:NilClass  Method:[rescue in deliver]
[----] E, [2015-03-30T06:38:27.284308 #24876:74fea0] ERROR -- : /var/www/miq/vmdb/app/models/ems_refresh.rb:73:in `block in refresh'
/var/www/miq/vmdb/app/models/ems_refresh.rb:67:in `each'
/var/www/miq/vmdb/app/models/ems_refresh.rb:67:in `group_by'
/var/www/miq/vmdb/app/models/ems_refresh.rb:67:in `refresh'
/var/www/miq/vmdb/app/models/miq_queue.rb:356:in `block in deliver'
/opt/rh/ruby200/root/usr/share/ruby/timeout.rb:66:in `timeout'
/var/www/miq/vmdb/app/models/miq_queue.rb:352:in `deliver'
/var/www/miq/vmdb/lib/workers/queue_worker_base.rb:107:in `deliver_queue_message'
/var/www/miq/vmdb/lib/workers/queue_worker_base.rb:135:in `deliver_message'
/var/www/miq/vmdb/lib/workers/queue_worker_base.rb:152:in `block in do_work'
/var/www/miq/vmdb/lib/workers/queue_worker_base.rb:146:in `loop'
/var/www/miq/vmdb/lib/workers/queue_worker_base.rb:146:in `do_work'
/var/www/miq/vmdb/lib/workers/worker_base.rb:317:in `block in do_work_loop'
/var/www/miq/vmdb/lib/workers/worker_base.rb:314:in `loop'
/var/www/miq/vmdb/lib/workers/worker_base.rb:314:in `do_work_loop'
/var/www/miq/vmdb/lib/workers/worker_base.rb:135:in `run'
/var/www/miq/vmdb/lib/workers/worker_base.rb:123:in `start'
/var/www/miq/vmdb/lib/workers/worker_base.rb:23:in `start_worker'
/var/www/miq/vmdb/lib/workers/bin/worker.rb:5:in `<top (required)>'

Comment 8 Joe Rafaniello 2015-03-30 19:12:14 UTC

Nandini, I ssh'd in and I don't see the logs showing "error reading from remote server."  

Is there anything special in terms of what the vmware broker is doing?  For example, it sounds like you have only 1 vsphere.  Are you seeing lots of events coming in?  (perhaps the broker is leaking when getting updates for events)?


I do see that we are growing in virtual memory size of the broker process:


[----] I, [2015-03-28T03:33:04.805308 #7424:c7be9c]  INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000018], PID [16149], GUID [6f5a25bc-d3bd-11e4-bc18-00505696bd7e], Last Heartbeat [2015-03-28 07:32:57 UTC], Process Info: Memory Usage [549744640], Memory Size [1576398848], Memory % [8.96], CPU Time [587907.0], CPU % [3.23], Priority [23]


[----] I, [2015-03-29T03:22:51.937815 #7424:c7be9c]  INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000018], PID [16149], GUID [6f5a25bc-d3bd-11e4-bc18-00505696bd7e], Last Heartbeat [2015-03-29 07:22:37 UTC], Process Info: Memory Usage [1241198592], Memory Size [3004997632], Memory % [20.23], CPU Time [591192.0], CPU % [2.21], Priority [23]

[----] I, [2015-03-30T03:07:48.479798 #7424:c7be9c]  INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000018], PID [16149], GUID [6f5a25bc-d3bd-11e4-bc18-00505696bd7e], Last Heartbeat [2015-03-30 07:07:27 UTC], Process Info: Memory Usage [1369726976], Memory Size [4376395776], Memory % [22.32], CPU Time [594849.0], CPU % [1.7], Priority [23]

Comment 9 Nandini Chandra 2015-03-31 17:44:37 UTC

Joe,

That's correct.The appliance is managing only 1 vsphere provider.I am not aware of anything special that the vmware broker is doing.

I did the check the number of events corresponding to the vsphere provider since 
3/29/2015.

vmdb_production=# select count(*) from ems_events where ems_id = 10000000000001;
 count 
-------
 29222
(1 row)

Also,I rebooted the appliance on 3/30 after which the proxy error went away.

Comment 10 Joe Rafaniello 2015-03-31 21:03:46 UTC

I looked at the appliance again and now the broker seems stable.

I ran this: grep -E "MiqVimBrokerWorker.+Memory Size" log/evm.log

first line:
[----] I, [2015-03-31T03:34:36.576145 #2235:8cdea8]  INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000903], PID [2267], GUID [d2a265a8-d700-11e4-b8f5-00505696bd7e], Last Heartbeat [2015-03-31 07:34:21 UTC], Process Info: Memory Usage [311226368], Memory Size [1164423168], Memory % [5.07], CPU Time [235519.0], CPU % [4.63], Priority [23]

last line (13 hours later)
[----] I, [2015-03-31T16:54:40.223722 #2230:5a3ea8]  INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000000903], PID [2267], GUID [d2a265a8-d700-11e4-b8f5-00505696bd7e], Last Heartbeat [2015-03-31 20:54:33 UTC], Process Info: Memory Usage [317313024], Memory Size [1170448384], Memory % [5.17], CPU Time [446312.0], CPU % [4.54], Priority [23]

The memory size hasn't changed very much in 13 hours.

Can you try to see if you can reliably reproduce it?

If we reproduce it, we might be able to dump the object stats that ruby knows of and narrow down the scope of the growing virtual memory size.

Comment 11 Nandini Chandra 2015-04-06 16:28:31 UTC

Joe,

The memory size for VimBrokerWorker seems to have gone up.The appliance has been up since 2015-03-31.

#grep -E "MiqVimBrokerWorker.+Memory Size" log/evm.log

[----] I, [2015-04-06T12:09:45.000498 #2235:8cdea8]  INFO -- : Q-task_id([log_status]) MIQ(MiqWorker.log_status) MiqVimBrokerWorker: [VMware Session Broker] Worker ID [10000000 002012], PID [25469], GUID [4f052bfa-db28-11e4-b8f5-00505696bd7e], Last Heartbeat [2015-04-06 16:09:37 UTC], Process Info: Memory Usage [1010511872], Memory Size [2632753152], Memory % [16.47], CPU Time [97269.0], CPU % [0.16], Priority [23]

Comment 12 Joe Rafaniello 2015-04-15 19:50:07 UTC

I've recreated this locally using a very recent nightly centos upstream build.

It's interesting to note that we have lots of CLOSE_WAIT file descriptors for the vim broker worker process.  1574 is the vim broker process id, (36692 is the DRb port the broker uses:

[root@10-16-7-74 vmdb]# lsof -p 1574 |grep 36692
ruby    1574 root   21u  IPv4            1942300      0t0     TCP localhost:36692 (LISTEN)
ruby    1574 root   23u  IPv4            1942578      0t0     TCP localhost:36692->localhost:55896 (CLOSE_WAIT)
ruby    1574 root   25u  IPv4            1990166      0t0     TCP localhost:36692->localhost:36085 (CLOSE_WAIT)
ruby    1574 root   26u  IPv4            1960686      0t0     TCP localhost:36692->localhost:59410 (ESTABLISHED)
ruby    1574 root   27u  IPv4            1956682      0t0     TCP localhost:36692->localhost:58843 (CLOSE_WAIT)
ruby    1574 root   28u  IPv4            1957109      0t0     TCP localhost:36692->localhost:58898 (CLOSE_WAIT)
ruby    1574 root   30u  IPv4            1960745      0t0     TCP localhost:36692->localhost:59425 (CLOSE_WAIT)
ruby    1574 root   31u  IPv4            1960748      0t0     TCP localhost:36692->localhost:59426 (CLOSE_WAIT)
ruby    1574 root   32u  IPv4            1961370      0t0     TCP localhost:36692->localhost:59582 (CLOSE_WAIT)
ruby    1574 root   33u  IPv4            1963132      0t0     TCP localhost:36692->localhost:59876 (CLOSE_WAIT)
ruby    1574 root   34u  IPv4            1963134      0t0     TCP localhost:36692->localhost:59877 (CLOSE_WAIT)
ruby    1574 root   35u  IPv4            1965793      0t0     TCP localhost:36692->localhost:60419 (CLOSE_WAIT)
ruby    1574 root   36u  IPv4            1965794      0t0     TCP localhost:36692->localhost:60418 (CLOSE_WAIT)
ruby    1574 root   37u  IPv4            1967377      0t0     TCP localhost:36692->localhost:60567 (CLOSE_WAIT)
ruby    1574 root   38u  IPv4            1967387      0t0     TCP localhost:36692->localhost:60568 (CLOSE_WAIT)
ruby    1574 root   39u  IPv4            1968849      0t0     TCP localhost:36692->localhost:60657 (CLOSE_WAIT)
ruby    1574 root   40u  IPv4            1968852      0t0     TCP localhost:36692->localhost:60658 (CLOSE_WAIT)
ruby    1574 root   41u  IPv4            1971497      0t0     TCP localhost:36692->localhost:32983 (CLOSE_WAIT)
ruby    1574 root   42u  IPv4            1971507      0t0     TCP localhost:36692->localhost:32984 (CLOSE_WAIT)
ruby    1574 root   43u  IPv4            1977183      0t0     TCP localhost:36692->localhost:33631 (CLOSE_WAIT)
ruby    1574 root   44u  IPv4            1977294      0t0     TCP localhost:36692->localhost:33642 (CLOSE_WAIT)
ruby    1574 root   45u  IPv4            1977300      0t0     TCP localhost:36692->localhost:33643 (CLOSE_WAIT)
ruby    1574 root   46u  IPv4            1977308      0t0     TCP localhost:36692->localhost:33644 (CLOSE_WAIT)
ruby    1574 root   47u  IPv4            1980418      0t0     TCP localhost:36692->localhost:34238 (CLOSE_WAIT)
ruby    1574 root   48u  IPv4            1980428      0t0     TCP localhost:36692->localhost:34239 (CLOSE_WAIT)
ruby    1574 root   49u  IPv4            1983904      0t0     TCP localhost:36692->localhost:35088 (CLOSE_WAIT)
ruby    1574 root   50u  IPv4            1983914      0t0     TCP localhost:36692->localhost:35089 (CLOSE_WAIT)
ruby    1574 root   51u  IPv4            1985284      0t0     TCP localhost:36692->localhost:35180 (CLOSE_WAIT)
ruby    1574 root   52u  IPv4            1985290      0t0     TCP localhost:36692->localhost:35181 (CLOSE_WAIT)
ruby    1574 root   53u  IPv4            1986927      0t0     TCP localhost:36692->localhost:35351 (CLOSE_WAIT)
ruby    1574 root   54u  IPv4            1986933      0t0     TCP localhost:36692->localhost:35352 (CLOSE_WAIT)
ruby    1574 root   55u  IPv4            1990176      0t0     TCP localhost:36692->localhost:36086 (CLOSE_WAIT)
ruby    1574 root   56u  IPv4            1993521      0t0     TCP localhost:36692->localhost:36836 (CLOSE_WAIT)
ruby    1574 root   57u  IPv4            1994095      0t0     TCP localhost:36692->localhost:36948 (CLOSE_WAIT)
ruby    1574 root   58u  IPv4            1994098      0t0     TCP localhost:36692->localhost:36949 (CLOSE_WAIT)
ruby    1574 root   59u  IPv4            1997517      0t0     TCP localhost:36692->localhost:37735 (CLOSE_WAIT)
ruby    1574 root   60u  IPv4            1997520      0t0     TCP localhost:36692->localhost:37736 (CLOSE_WAIT)
ruby    1574 root   61u  IPv4            2000181      0t0     TCP localhost:36692->localhost:38252 (CLOSE_WAIT)
ruby    1574 root   62u  IPv4            2000191      0t0     TCP localhost:36692->localhost:38253 (CLOSE_WAIT)
ruby    1574 root   63u  IPv4            2003734      0t0     TCP localhost:36692->localhost:39124 (CLOSE_WAIT)
ruby    1574 root   64u  IPv4            2003744      0t0     TCP localhost:36692->localhost:39125 (CLOSE_WAIT)
ruby    1574 root   65u  IPv4            2007717      0t0     TCP localhost:36692->localhost:40121 (CLOSE_WAIT)
ruby    1574 root   66u  IPv4            2007724      0t0     TCP localhost:36692->localhost:40122 (CLOSE_WAIT)
ruby    1574 root   67u  IPv4            2011000      0t0     TCP localhost:36692->localhost:40759 (CLOSE_WAIT)
ruby    1574 root   68u  IPv4            2011018      0t0     TCP localhost:36692->localhost:40762 (CLOSE_WAIT)
ruby    1574 root   69u  IPv4            2013323      0t0     TCP localhost:36692->localhost:41316 (CLOSE_WAIT)
ruby    1574 root   70u  IPv4            2013593      0t0     TCP localhost:36692->localhost:41329 (ESTABLISHED)
ruby    1574 root   71u  IPv4            2016131      0t0     TCP localhost:36692->localhost:41603 (CLOSE_WAIT)
ruby    1574 root   72u  IPv4            2016134      0t0     TCP localhost:36692->localhost:41604 (CLOSE_WAIT)
ruby    1574 root   73u  IPv4            2018926      0t0     TCP localhost:36692->localhost:42167 (CLOSE_WAIT)
ruby    1574 root   74u  IPv4            2018936      0t0     TCP localhost:36692->localhost:42168 (CLOSE_WAIT)
ruby    1574 root   75u  IPv4            2020485      0t0     TCP localhost:36692->localhost:42252 (ESTABLISHED)
ruby    1574 root   76u  IPv4            2020490      0t0     TCP localhost:36692->localhost:42253 (ESTABLISHED)

Comment 13 Joe Rafaniello 2015-04-21 18:15:35 UTC

After instrumenting an appliance with logging[1], we see connects always followed later by a disconnect for the same process in the miq_fault_tolerant_vim code, until we run into a soap fault error:

[Handsoap::Fault]: Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct.

After this fault error occurs, we only see logging of the connect at the miq_fault_tolerant_vim level, no disconnect seem to occur.  It's not clear this is causing the memory increase but it's very interesting.

We also grep'd QE's problematic appliances also exhibiting the increased broker memory usage and this error "A specified parameter was not correct" also occurs there.  We may not be recovering from this error properly and possibly deadlocking the broker.

[1]
diff --git a/lib/util/miq_fault_tolerant_vim.rb b/lib/util/miq_fault_tolerant_vim.rb
index 388a3c7..3ffd476 100644
--- a/lib/util/miq_fault_tolerant_vim.rb
+++ b/lib/util/miq_fault_tolerant_vim.rb
@@ -80,6 +81,9 @@ class MiqFaultTolerantVim
   end

   def _execute_with_broker(state)
+    msg = caller.join("\nXXX ")
+    $log.info("XXX #{__method__} #{state} #{msg}")
+
     log_header = "MIQ(#{self.class.name}._execute_with_broker)"
     retries = 0
     begin

Comment 15 Joe Rafaniello 2015-04-21 18:18:51 UTC

Grep'd from a QE appliance running ruby 2.0.0, the backtrace is nearly identical.



log/evm.log-20150419:[----] E, [2015-04-18T11:56:55.299576 #22535:a31eac] ERROR -- : [Handsoap::Fault]: Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct.
log/evm.log-20150419-' }  Method:[rescue in deliver]
log/evm.log-20150419-[----] E, [2015-04-18T11:56:55.299645 #22535:a31eac] ERROR -- : (druby://127.0.0.1:52244) /opt/rh/cfme-gemset/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:198:in `on_fault'
log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/cfme-gemset/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:288:in `dispatch'
log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/cfme-gemset/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:192:in `invoke'
log/evm.log-20150419-(druby://127.0.0.1:52244) /var/www/miq/lib/VMwareWebService/VimService.rb:676:in `queryPerfProviderSummary'
log/evm.log-20150419-(druby://127.0.0.1:52244) /var/www/miq/lib/VMwareWebService/MiqVimPerfHistory.rb:100:in `queryProviderSummary'
log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1588:in `perform_without_block'
log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1548:in `perform'
log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1626:in `block (2 levels) in main_loop'
log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1622:in `loop'
log/evm.log-20150419-(druby://127.0.0.1:52244) /opt/rh/ruby200/root/usr/share/ruby/drb/drb.rb:1622:in `block in main_loop'
log/evm.log-20150419-(druby://127.0.0.1:52244) /var/www/miq/vmdb/lib/extensions/ar_thread.rb:22:in `block in start_with_release'
log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/capture/vim.rb:40:in `realtime_interval'
log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:88:in `block in perf_capture_intervals'
log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:86:in `each'
log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:86:in `perf_capture_intervals'
log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:46:in `block in perf_collect_metrics_vim'
log/evm.log-20150419-/var/www/miq/lib/util/extensions/miq-benchmark.rb:12:in `realtime_store'
log/evm.log-20150419-/var/www/miq/lib/util/extensions/miq-benchmark.rb:31:in `realtime_block'
log/evm.log-20150419-/var/www/miq/vmdb/app/models/metric/ci_mixin/capture/vim.rb:46:in `perf_collect_metrics_vim'

Comment 16 Joe Rafaniello 2015-04-21 22:30:50 UTC

Created attachment 1017159 [details]
Spreadsheet with broker memory grow as reported by top_output

Comment 17 Joe Rafaniello 2015-04-21 22:36:36 UTC

Created attachment 1017160 [details]
Graph of vim broker memory zoomed out

Shows the slow growth of both REST and VIRT memory for the vim broker process, where the growth appears to start "roughly" at the same time we last we see "_execute_with_broker" called with "on_disconnect" [1]

[1] https://github.com/ManageIQ/manageiq/blob/9adecd6b0ebc2b1e4dfb045a97f53c6d2276b3e6/lib/util/miq_fault_tolerant_vim.rb#L83

Prior to this point, we see
1) "_execute_with_broker" called with "on_connect"
2) zero or more "_execute_with_broker" called with "on_execute" 
3) "_execute_with_broker" called with "on_disconnect"

After this point, we see only:
1) "_execute_with_broker" called with "on_connect"

Comment 18 Joe Rafaniello 2015-04-21 22:48:48 UTC

Created attachment 1017162 [details]
Zoomed in graph of the vim broker memory usage

This graph is zoomed in and we can see clearly that we see the last miq_fault_tolerant_vim "on_disconnect" (and only "on_connect thereafter) occurs at 19:15:24, memory stays relatively stable but then at 19:30:55, we see VIRT memory start to rise as reported by top.

Note, it looks like my appliance has a 15 minute tcp retries timeout, see below:

http://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux

"
tcp_retries2 - INTEGER
The default value of 15 yields a hypothetical timeout of 924.6
    seconds and is a lower bound for the effective timeout.
    TCP will effectively time out at the first RTO which exceeds the
    hypothetical timeout."


# cat /proc/sys/net/ipv4/tcp_retries2
15

Comment 19 Joe Rafaniello 2015-04-21 23:01:09 UTC

Other findings:
1) we see "[Handsoap::Fault]: Handsoap::Fault { :code => 'ServerFaultCode', :reason => 'A specified parameter was not correct." error many times around the "last" time we see the miq_fault_tolerant_vim try to disconnect (2015-04-20T19:15:24.095238)

2) We see many Handsoap::Fault errors like above
A total of 9497 Handsoap::Fault in evm.log-20150421.

Of which 9494 are 'A specified parameter was not correct.' Handsoap::Fault errors.

Many thousands are before 20T19:15:24.

Comment 20 Joe Rafaniello 2015-04-21 23:06:36 UTC

Rich, I can send you any logs or graphs you may need to help.  At this point, we're trying unsuccessfully to recreate this issue with  MiqVimBrokerClient/MiqVimBrokerServer combination locally on our appliance.

Any thoughts what we can try next?

Jason is suspecting we might be hitting a broker deadlock but clearly just running this broker over night with a single vsphere causes several CLOSE_WAITs (hundreds) and VIRT memory doubles from 1000m to 2000m.  We'd like to narrow it down outside of the full worker setup but haven't had any luck yet.

Comment 24 Joe Rafaniello 2015-04-23 03:46:59 UTC

Created attachment 1017697 [details]
2015-04-22T19-58-34_zoomed_out (all time samples)

Comment 25 Joe Rafaniello 2015-04-23 03:47:45 UTC

Created attachment 1017698 [details]
2015-04-22T19-58-34_zoomed_in (1 hour before and 2 hours after last good vim broker client connect)

Comment 26 Rich Oliveri 2015-04-23 14:21:37 UTC

Did you run with SyncDebug?

Comment 28 Joe Rafaniello 2015-04-24 13:15:03 UTC

RE: using httpclient 2.2.7, the problem still exists, the symptoms of growing VIRT and RES memory and CLOSE_WAIT TCP connections on the DRb port is identical with 2.2.7 as it was with 2.5.3.3.  Note, we shipped 2.2.7 with our first release dating back to the cfme 5.2.0.37 build: https://access.redhat.com/errata/RHEA-2013:1487

Comment 29 Joe Rafaniello 2015-04-24 14:13:56 UTC

At this point, we have two major problems to fix: 

1) isolate and fix the broker lock problem

2) improve the broker client code so that the other workers can report the broker as unavailable if they're queue task times them out while they're trying to connect to vim via the broker.  The broker should not remain up and consuming memory if it's not accomplishing any useful work.  It should be killed and a new broker started.

Comment 30 Joe Rafaniello 2015-04-29 21:50:33 UTC

Here is Rich's added logging a lock timing out:

[----] E, [2015-04-27T17:35:35.153293 #1634:4cba1fc] ERROR -- : VimSyncDebug - Lock timeout: thread 78668700 has held MiqVimBroker#lockHash[vsphere_HOSTNAME_administrator] for 417 seconds

Here is the thread waiting on the lock:

[----] E, [2015-04-27T17:35:35.153405 #1634:4cba1fc] ERROR -- : VimSyncDebug - Lock acquisition: Start backtrace
[----] E, [2015-04-27T17:35:35.153522 #1634:4cba1fc] ERROR -- : /var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:368:in `connSync'
/var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:412:in `getMiqVim'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1588:in `perform_without_block'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1548:in `perform'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1626:in `block (2 levels) in main_loop'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1622:in `loop'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1622:in `block in main_loop'
/var/www/miq/vmdb/lib/extensions/ar_thread.rb:22:in `block in start_with_release'
[----] E, [2015-04-27T17:35:35.153667 #1634:4cba1fc] ERROR -- : VimSyncDebug - Lock acquisition: End backtrace


Here is the locking thread backtrace:

[----] E, [2015-04-27T17:35:35.153782 #1634:4cba1fc] ERROR -- : VimSyncDebug - Locking thread: Start backtrace
[----] E, [2015-04-27T17:35:35.154287 #1634:4cba1fc] ERROR -- : /opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/timeout.rb:70:in `join'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/timeout.rb:70:in `timeout'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/timeout.rb:97:in `timeout'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:762:in `connect'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:620:in `query'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:164:in `query'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1161:in `do_get_block'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:962:in `block in do_request'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1059:in `protect_keep_alive_disconnected'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:961:in `do_request'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:823:in `request'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:726:in `post'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:280:in `dispatch'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/gems/2.0.0/bundler/gems/handsoap-daec1277652f/lib/handsoap/service.rb:192:in `invoke'
/var/www/miq/lib/VMwareWebService/VimService.rb:865:in `retrieveProperties'
/var/www/miq/lib/VMwareWebService/MiqVimInventory.rb:1955:in `getMoProp_local'
/var/www/miq/lib/VMwareWebService/MiqVimInventory.rb:1992:in `getMoProp'
/var/www/miq/lib/VMwareWebService/MiqVimInventory.rb:370:in `currentSession'
/var/www/miq/lib/VMwareWebService/MiqVimInventory.rb:376:in `isAlive?'
/var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:415:in `block in getMiqVim'
/var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:374:in `call'
/var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:374:in `connSync'
/var/www/miq/lib/VMwareWebService/MiqVimBroker.rb:412:in `getMiqVim'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1588:in `perform_without_block'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1548:in `perform'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1626:in `block (2 levels) in main_loop'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1622:in `loop'
/opt/rubies/ruby-2.0.0-p643/lib/ruby/2.0.0/drb/drb.rb:1622:in `block in main_loop'
/var/www/miq/vmdb/lib/extensions/ar_thread.rb:22:in `block in start_with_release'
[----] E, [2015-04-27T17:35:35.154429 #1634:4cba1fc] ERROR -- : VimSyncDebug - Locking thread: End backtrace

Comment 31 Joe Rafaniello 2015-04-30 17:43:58 UTC

Upstream PR with Timeout.timeout re-implementation for the broker:

https://github.com/ManageIQ/manageiq/pull/2759

Comment 32 Joe Rafaniello 2015-05-05 15:58:25 UTC

See upstream PR for discussion.

Comment 33 CFME Bot 2015-05-05 16:08:17 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/5a4b8473a45e721a79a2b46325037808e30977c7

commit 5a4b8473a45e721a79a2b46325037808e30977c7
Author:     Richard Oliveri <roliveri>
AuthorDate: Mon Apr 27 14:40:41 2015 -0400
Commit:     Richard Oliveri <roliveri>
CommitDate: Fri May 1 12:35:24 2015 -0400

    Added additional lock debugging in broker.
    
    Additional lock debugging added in order to resolve broker thread hang issue.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1207018

 lib/VMwareWebService/MiqVimBroker.rb        | 46 +++++++++++++---
 lib/VMwareWebService/MiqVimInventory.rb     |  5 +-
 lib/VMwareWebService/VimSyncDebug.rb        | 81 ++++-------------------------
 lib/VMwareWebService/broker_sync_debug.rb   | 33 ++++++++++++
 lib/VMwareWebService/vim_base_sync_debug.rb | 68 ++++++++++++++++++++++++
 5 files changed, 155 insertions(+), 78 deletions(-)
 create mode 100644 lib/VMwareWebService/broker_sync_debug.rb
 create mode 100644 lib/VMwareWebService/vim_base_sync_debug.rb

Comment 35 Nandini Chandra 2015-05-18 20:59:17 UTC

Verified the following: 

1)vim broker doesn't leak drb tcp file descriptor references 

2)Verified through top -H -p <pid of broker>,that the broker process has 20 threads and that the number of threads doesn't grow to several hundreds. 

3)My appliance has been running for 5 days,there are no proxy errors and the web UI has remained accessible since then.

Verified in 5.6.0.0.26

Comment 37 errata-xmlrpc 2015-06-16 12:56:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1100.html

Note You need to log in before you can comment on or make changes to this bug.