Description of problem: Unable to run smartstate analysis of esxi 5.5 host; it gets stuck on refreshing firewall rules and later times out after 1200secs. Version-Release number of selected component (if applicable): 5.4.0.0.14 How reproducible: Randomly against esxi 5 or 5.5 (VIM) Steps to Reproduce: 1. Add vmware 5/5.5 provider 2. Add creds to one of the hosts 3. Run smartstate analysis of that host 4. Check Configuration > Tasks - there is a task stuck at "Refreshing Firewall Rules" Actual results: Smartstate analysis task gets stuck and times out after 1200secs. Expected results: Smartstate analysis of a vsphere 5/5.5 host works. Additional info: --IMPORTANT-- I was able to get it to work after running 'service evmserverd restart' but after some time it stops working. The following message showed up in evm.log when I tried to restart the evmserverd and it got stuck and waiting for the 1200sec timeout of the worker. [----] E, [2015-04-09T15:39:03.694511 #11543:badeac] ERROR -- : MIQ(MiqFaultTolerantVim._connect) EMS: [vSphere 5.5] [Broker] Unable to connect to: [<vsphere 5.5 IP>] because Broker is not available (connection error). When I'm not trying to restart the evmserverd, I get just: [----] E, [2015-04-09T05:32:05.950003 #7847:adbeac] ERROR -- : MIQ(MiqQueue.deliver) Message id: [52810], timed out after 1200.00381906 seconds. Timeout threshold [1200]
This does not happen with 5.3.z running on the same provider / in the same network.
Thinking this is a symptom of bug 1207018 , need to retest when we have a fix for it.
Assigning to Joe Rafaniello who is investigating bug 1207018.
Jan, I have identified the most common symptoms in bug 1207018. That bug we're still tracking down but I've seen the broker function normally for several hours before it starts leaking. As long as you don't have these symptoms, you can run your test scenario and be sure it's not that bug causing your problem. Symptoms: CLOSE_WAIT TCP connections on the MiqVimBrokerWorker's DRb port. To get the DRb port of the broker: # bin/rake evm:status |grep Broker MiqVimBrokerWorker | started | 3554 | 20903 | 21028 | druby://127.0.0.1:47577 | 2015-04-23T17:36:15Z | 2015-04-23T17:39:53Z The port is 47577 in this case. As long as lsof is only showing ESTABLISHED or LISTEN, it's fine to do your test: # lsof -iTCP | grep 47577 ruby 20820 root 22u IPv4 5671690 0t0 TCP localhost:46273->localhost:47577 (ESTABLISHED) ruby 20820 root 23u IPv4 5672454 0t0 TCP localhost:46441->localhost:47577 (ESTABLISHED) ruby 20824 root 22u IPv4 5672425 0t0 TCP localhost:46439->localhost:47577 (ESTABLISHED) ruby 20824 root 23u IPv4 5671721 0t0 TCP localhost:46282->localhost:47577 (ESTABLISHED) ruby 20843 root 22u IPv4 5670427 0t0 TCP localhost:46066->localhost:47577 (ESTABLISHED) ruby 20843 root 23u IPv4 5670435 0t0 TCP localhost:46068->localhost:47577 (ESTABLISHED) ruby 20903 root 20u IPv4 5670056 0t0 TCP localhost:47577 (LISTEN) ruby 20903 root 23u IPv4 5670428 0t0 TCP localhost:47577->localhost:46066 (ESTABLISHED) ruby 20903 root 24u IPv4 5672426 0t0 TCP localhost:47577->localhost:46439 (ESTABLISHED) ruby 20903 root 25u IPv4 5670436 0t0 TCP localhost:47577->localhost:46068 (ESTABLISHED) ruby 20903 root 26u IPv4 5672455 0t0 TCP localhost:47577->localhost:46441 (ESTABLISHED) ruby 20903 root 28u IPv4 5671691 0t0 TCP localhost:47577->localhost:46273 (ESTABLISHED) ruby 20903 root 29u IPv4 5671722 0t0 TCP localhost:47577->localhost:46282 (ESTABLISHED)
Dave, see comment 5... Note, comment 5 forgot to mention that lsof showing CLOSE_WAIT TCP connections on the broker's DRb (druby) port is the clear sign that you hit the bug 1207018. As long as you don't have this, you should be able to recreate the "ESXi host smartstate analysis fails" issue, provide logs and get it fixed without concern of the broker bug. Additionally, I have only seen bug 120718 occur if you have vmware capacity and utilization enabled so if you disable cap & u and do your smartstate analysis, you should be able to track down this issue in this bug... I am very confident the "broker is unavailable" would not be related to the CLOSE_WAIT/drb bug if you disable the cap and u for your test.
typo, bug 120718, should have been bug 1207018
Working in 5.4.0.0.24.20150427192818_1fd9e49 for vSphere 5, 5.5. I believe this is due to the leaky file descriptor bug
Clearing needinfo
Awesome, thanks Dave/Thom!