Bug 1155679

Summary: virt-who performance issue for large ESX installation
Product: Red Hat Enterprise Linux 6 Reporter: Thom Carlin <tcarlin>
Component: virt-whoAssignee: Radek Novacek <rnovacek>
Status: CLOSED ERRATA QA Contact: gaoshang <sgao>
Severity: high Docs Contact:
Priority: high    
Version: 6.6CC: ayersmj, bkearney, cww, edanilch, gxing, hsun, jkurik, jsvarova, liliu, mmccune, ovasik, rbalakri, rnovacek, shihliu, tcarlin, ttracy, wlehman, xdmoon
Target Milestone: rcKeywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: http://projects.theforeman.org/issues/8279
Whiteboard:
Fixed In Version: virt-who-0.12-4.el6 Doc Type: Bug Fix
Doc Text:
Previously, the virt-who agent was too slow when reading the association between hosts and guests from VMWare ESX systems. As a consequence, when communicating with large ESX (or vCenter) deployments, it took a lot of time to send updates about virtual guests to the Subscription Asset Manager (SAM) and Red Hat Satellite. With this update, virt-who uses an improved method to obtain host-guest association, which accelerates the aforementioned process.
Story Points: ---
Clone Of:
: 1197596 (view as bug list) Environment:
Last Closed: 2015-07-06 08:42:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1075802, 1154684, 1168221, 1197596    
Attachments:
Description Flags
virt-who.0-12-esx-rhsm.log none

Description Thom Carlin 2014-10-22 15:32:41 UTC
Description of problem:

Customer has around 100 ESX hypervisors in 1 vCenter.  Running virt-who --one-shot takes approximately one hour to run.


Version-Release number of selected component (if applicable):

6.6

How reproducible:

Everytime

Steps to Reproduce:
1. virt-who --debug --one-shot --esx [and esx parameters]
2.
3.

Actual results:

1 hour passes

Expected results:

Under a minute to run

Additional info:

virt-who is needed for Sat6 Unlimited Guest licensing

Comment 2 Tom McKay 2014-11-05 12:57:11 UTC
Created redmine issue http://projects.theforeman.org/issues/8279 from this bug

Comment 10 Radek Novacek 2015-02-26 09:46:31 UTC
This bug is addressed by upstream and will be fixed by rebase in RHEL-6.7.

Comment 11 Radek Novacek 2015-02-27 19:35:32 UTC
Fixed by rebase to virt-who-0.12-1.el6.

Comment 14 xingge 2015-03-10 09:52:01 UTC
Hi,
  we want to verify this bug but we don't have the environment to do the performance testing, we only have two ESX machines. we did a lot of research about how to do performance testing with two machines but we failed, so is there someone else can help us to do the performance testing?

Comment 15 Jan Kurik 2015-03-10 10:26:58 UTC
Performance testing is typically done by Performance Engineering team https://home.corp.redhat.com/wiki/performance-engineering-rhel . I am not sure whether they can do the sort of testing you need, however we can at least ask Thomas.

@Thomas: may I ask you please to help xingge with the testing, if possible ?

Comment 16 Tom Tracy 2015-03-10 12:25:06 UTC
Jan
     We do not have the infrastructure to test an environment like this. I do not know who has esx machines capable of doing this testing. 

Tom

Comment 17 William 2015-04-02 16:36:36 UTC
Created attachment 1010250 [details]
virt-who.0-12-esx-rhsm.log

Tested in customer environment. Getting a large amount of noise in the logs now.

Comment 18 Radek Novacek 2015-04-07 05:55:09 UTC
Setting VIRTWHO_DEBUG=0 in /etc/sysconfig/virt-who should reduce the noise. If not, I can remove some of the logging messages.

Comment 19 Radek Novacek 2015-04-07 08:06:56 UTC
I've investigated the error messages more deeply and it looks like it's caused by the fact that ESXi/vCenter splits the guest list to more parts when it's big enough and virt-who only reads the first part and therefore it produces incomplete host/guest association. It probably fixes itself during next iteration, but it should still be corrected.

I'll make updated version for testing. Moving the bug to ASSIGNED for now.

Comment 20 Radek Novacek 2015-04-07 10:22:06 UTC
Fixed in virt-who-0.12-4.el6.

Please try following build, it should fix the "noise" in the logs:

https://brewweb.devel.redhat.com/taskinfo?taskID=8949195

Comment 22 Liushihui 2015-06-12 02:40:57 UTC
Hi William,

Would you mind to help us to verify it again on customer environment since we still don't have the environment to do the performance testing? Thanks.

Liushihui

Comment 25 xingge 2015-06-25 11:52:02 UTC
As we do not have the environment to test this bug so I'll close this bug. If the bug shows again in the customer's environment please reopen this bug .

Comment 26 xingge 2015-06-25 13:08:46 UTC
Thom Carlin reached out to the RHC project managers so they can contact the 2 customers to verify the bug, so reopen it and wait for the results

Comment 27 xingge 2015-07-01 07:54:40 UTC
Hi Thom,
  Since this bug is listed in the virt-who errata [1] which is needed to push out asap along with the rhel6.7 release, and for now, in the errata [1] this bug is the only bug which is not verified, so could you please help coordinate to get this bug verified at your earliest convenience so that we can push out the errata timely? Thanks!

Comment 28 Liushihui 2015-07-02 04:31:51 UTC
Summary:
As we haven't the env with many hosts/guests, we only check it with three hosts and one guests.we just verify the response time which virt-who get data from vcenter. the response time in the new version virt-who-0.12-10.el6.noarch is much less than the old version virt-who-0.10-8.el6.noarch.
About the problem of redundant noise log still need Thom Carlin's help to verify it. 

In the older version of virt-who-0.10-8.el6.noarch
The time of virt-who get data from vcenter is about 8s, please see the log as the following:
2015-07-02 11:15:22,370 [INFO]  @virtwho.py:442 - Using virt-who configuration: virt-who
2015-07-02 11:15:22,370 [DEBUG]  @virtwho.py:170 - Starting infinite loop with 3600 seconds interval
2015-07-02 11:15:22,519 [INFO]  @esx.py:134 - start scan
2015-07-02 11:15:27,898 [INFO]  @esx.py:164 - start get comput resource
2015-07-02 11:15:28,149 [INFO]  @esx.py:196 - start get host
2015-07-02 11:15:28,378 [INFO]  @esx.py:218 - start get vm
2015-07-02 11:15:30,875 [INFO]  @esx.py:325 - end scan
2015-07-02 11:15:30,875 [INFO]  @esx.py:327 - ==== time 8 
========================the response time is 8s============================
2015-07-02 11:15:30,913 [INFO]  @subscriptionmanager.py:119 - Sending update in hosts-to-guests mapping: {564d4b40-300e-98fd-fc79-5ae3c1547e2d: [], aee4ff00-8c33-11e2-994a-6c3be51d959a: [564dab7d-3b72-51a1-eeda-586036106892, 4227d611-0abc-fc5e-7538-07ebd83fa9ba, 421aa84b-a49e-e01c-fa72-0b570372dd9d, 564daa52-c518-b2f0-3c05-a343285910e1]}

In the new version of virt-who-0.12-10.el6.noarch:
The time of virt-who get data from vcenter is less than 1s, please see the log as the following:
2015-07-02 12:17:41,300 [INFO]  @virtwho.py:563 - Using configuration "env/cmdline" ("esx" mode)
2015-07-02 12:17:41,324 [DEBUG]  @virtwho.py:151 - Starting infinite loop with 3600 seconds interval
2015-07-02 12:17:41,463 [DEBUG]  @esx.py:56 - Log into ESX
2015-07-02 12:17:42,554 [DEBUG]  @esx.py:59 - Creating ESX event filter
2015-07-02 12:17:42,903 [INFO]  @esx.py:186 - end scan
2015-07-02 12:17:42,912 [INFO]  @esx.py:138 - ==== time 0 
========================the response time is 0s============================
2015-07-02 12:17:42,912 [DEBUG]  @esx.py:142 - Waiting for ESX changes
2015-07-02 12:17:42,923 [INFO]  @subscriptionmanager.py:124 - Sending upda
te in hosts-to-guests mapping: {564d4b40-300e-98fd-fc79-5ae3c1547e2d: [], aee4ff00-8c33-11e2-994a-6c3be51d959a: [564dab7d-3b72-51a1-eeda-586036106892, 4227d611-0abc-fc5e-7538-07ebd83fa9ba, 421aa84b-a49e-e01c-fa72-0b570372dd9d, 564daa52-c518-b2f0-3c05-a343285910e1], 564d9e7a-4128-92b6-7284-6335f6b399be: []}

Comment 29 Eko 2015-07-06 08:42:35 UTC
Because we need to push Errata according to the schedule, change this bug to CLOSED status, and we will test the performance issue on RHEL7.2.

Comment 30 Eko 2015-07-06 09:25:46 UTC
We don't have the environment to test the performance issue, but we do some analysis according to the source code, there are some enhancement for this issue according to the source code. closed this bug and continue to verify this issue on rhel7.2.