1210480 – ESXi host smartstate analysis fails

Bug 1210480 - ESXi host smartstate analysis fails

Summary: ESXi host smartstate analysis fails

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	SmartState Analysis
Sub Component:
Version:	5.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.4.0
Assignee:	Joe Rafaniello
QA Contact:	Dave Johnson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-04-09 20:04 UTC by Jan Krocil
Modified:	2015-04-30 17:42 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-04-30 17:35:57 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1207018	0	urgent	CLOSED	[perf] vim broker leaking drb tcp file descriptor references	2021-02-22 00:41:40 UTC

Internal Links: 1207018

Description Jan Krocil 2015-04-09 20:04:42 UTC

Description of problem:
Unable to run smartstate analysis of esxi 5.5 host; it gets stuck on refreshing firewall rules and later times out after 1200secs.

Version-Release number of selected component (if applicable):
5.4.0.0.14

How reproducible:
Randomly against esxi 5 or 5.5 (VIM)

Steps to Reproduce:
1. Add vmware 5/5.5 provider
2. Add creds to one of the hosts
3. Run smartstate analysis of that host
4. Check Configuration > Tasks - there is a task stuck at "Refreshing Firewall Rules"

Actual results:
Smartstate analysis task gets stuck and times out after 1200secs.

Expected results:
Smartstate analysis of a vsphere 5/5.5 host works.

Additional info:

--IMPORTANT--
I was able to get it to work after running 'service evmserverd restart' but after some time it stops working.

The following message showed up in evm.log when I tried to restart the evmserverd and it got stuck and waiting for the 1200sec timeout of the worker.
[----] E, [2015-04-09T15:39:03.694511 #11543:badeac] ERROR -- : MIQ(MiqFaultTolerantVim._connect)
 EMS: [vSphere 5.5] [Broker] Unable to connect to: [<vsphere 5.5 IP>] because Broker is not available (connection error).

When I'm not trying to restart the evmserverd, I get just:
[----] E, [2015-04-09T05:32:05.950003 #7847:adbeac] ERROR -- : MIQ(MiqQueue.deliver)    Message id: [52810], timed out after 1200.00381906 seconds.  Timeout threshold [1200]

Comment 2 Jan Krocil 2015-04-09 20:24:19 UTC

This does not happen with 5.3.z running on the same provider / in the same network.

Comment 3 Dave Johnson 2015-04-17 18:40:44 UTC

Thinking this is a symptom of bug 1207018 , need to retest when we have a fix for it.

Comment 4 Oleg Barenboim 2015-04-20 18:53:43 UTC

Assigning to Joe Rafaniello who is investigating bug 1207018.

Comment 5 Joe Rafaniello 2015-04-23 17:45:38 UTC

Jan, I have identified the most common symptoms in bug 1207018.  That bug we're still tracking down but I've seen the broker function normally for several hours before it starts leaking.  As long as you don't have these symptoms, you can run your test scenario and be sure it's not that bug causing your problem.

Symptoms:
CLOSE_WAIT TCP connections on the MiqVimBrokerWorker's DRb port.

To get the DRb port of the broker:
# bin/rake evm:status |grep Broker 

MiqVimBrokerWorker                 | started | 3554 | 20903 | 21028 | druby://127.0.0.1:47577 | 2015-04-23T17:36:15Z | 2015-04-23T17:39:53Z

The port is 47577 in this case.

As long as lsof is only showing ESTABLISHED or LISTEN, it's fine to do your test:

# lsof -iTCP | grep 47577
ruby      20820      root   22u  IPv4 5671690      0t0  TCP localhost:46273->localhost:47577 (ESTABLISHED)
ruby      20820      root   23u  IPv4 5672454      0t0  TCP localhost:46441->localhost:47577 (ESTABLISHED)
ruby      20824      root   22u  IPv4 5672425      0t0  TCP localhost:46439->localhost:47577 (ESTABLISHED)
ruby      20824      root   23u  IPv4 5671721      0t0  TCP localhost:46282->localhost:47577 (ESTABLISHED)
ruby      20843      root   22u  IPv4 5670427      0t0  TCP localhost:46066->localhost:47577 (ESTABLISHED)
ruby      20843      root   23u  IPv4 5670435      0t0  TCP localhost:46068->localhost:47577 (ESTABLISHED)
ruby      20903      root   20u  IPv4 5670056      0t0  TCP localhost:47577 (LISTEN)
ruby      20903      root   23u  IPv4 5670428      0t0  TCP localhost:47577->localhost:46066 (ESTABLISHED)
ruby      20903      root   24u  IPv4 5672426      0t0  TCP localhost:47577->localhost:46439 (ESTABLISHED)
ruby      20903      root   25u  IPv4 5670436      0t0  TCP localhost:47577->localhost:46068 (ESTABLISHED)
ruby      20903      root   26u  IPv4 5672455      0t0  TCP localhost:47577->localhost:46441 (ESTABLISHED)
ruby      20903      root   28u  IPv4 5671691      0t0  TCP localhost:47577->localhost:46273 (ESTABLISHED)
ruby      20903      root   29u  IPv4 5671722      0t0  TCP localhost:47577->localhost:46282 (ESTABLISHED)

Comment 6 Joe Rafaniello 2015-04-24 13:54:15 UTC

Dave, see comment 5... 

Note, comment 5 forgot to mention that lsof showing CLOSE_WAIT TCP connections on the broker's DRb (druby) port is the clear sign that you hit the bug 1207018.  As long as you don't have this, you should be able to recreate the "ESXi host smartstate analysis fails" issue, provide logs and get it fixed without concern of the broker bug.

Additionally, I have only seen bug 120718 occur if you have vmware capacity and utilization enabled so if you disable cap & u and do your smartstate analysis, you should be able to track down this issue in this bug... I am very confident the "broker is unavailable" would not be related to the CLOSE_WAIT/drb bug if you disable the cap and u for your test.

Comment 7 Joe Rafaniello 2015-04-24 13:55:33 UTC

typo, bug 120718, should have been bug 1207018

Comment 8 Thom Carlin 2015-04-30 17:35:57 UTC

Working in 5.4.0.0.24.20150427192818_1fd9e49 for vSphere 5, 5.5.  I believe this is due to the leaky file descriptor bug

Comment 9 Dave Johnson 2015-04-30 17:37:48 UTC

Clearing needinfo

Comment 10 Joe Rafaniello 2015-04-30 17:42:05 UTC

Awesome, thanks Dave/Thom!

Note You need to log in before you can comment on or make changes to this bug.