Bug 1515906 - Performance regression in 5.9.0 while provisioning of 50 vms on RHEV-3.5
Summary: Performance regression in 5.9.0 while provisioning of 50 vms on RHEV-3.5
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Performance
Version: 5.9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.9.0
Assignee: dmetzger
QA Contact: Pradeep Kumar Surisetty
URL:
Whiteboard: testathon
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-21 15:14 UTC by Pradeep Kumar Surisetty
Modified: 2017-12-26 14:59 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-26 12:32:03 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
5.8.23 appliance memory utilization (14.74 KB, image/png)
2017-11-22 15:29 UTC, dmetzger
no flags Details
5.9.0 appliance memory utilization (17.35 KB, image/png)
2017-11-22 15:31 UTC, dmetzger
no flags Details

Description Pradeep Kumar Surisetty 2017-11-21 15:14:06 UTC
Description of problem:

1) Added RHEV 3.5.5 to CFME 4.8.23. Provisioned 50 vms using template and measured time.  It took 13 mins to complete provision of 50 vms
2) Added RHEV 3.5.5 to CFME 4.9.0. Provisioned 50 vms using template and measured time to completion of vm's provision. It took more than 20 mins to complete provision of 50 vms with same provider. 

Looks like performance regression. 

Noticed below message count in evm.log during this interval

5.9:
-----
     50 Command: [ManageIQ::Providers::Redhat::InfraManager::Provision.poll_clone_complete]
    450 Command: [MiqAeEngine.deliver]

5.8
-----

  50 Command: [ManageIQ::Providers::Redhat::InfraManager::Provision.execute]
     48 Command: [ManageIQ::Providers::Redhat::InfraManager::Provision.poll_clone_complete]
    418 Command: [MiqAeEngine.deliver]



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Provision 50 vms on RHEV from CFME 5.8.23 & then 5.9.0
2.
3.

Actual results:

5.9.0 (> 20mins)  taking more time than 5.8.23 (13 mins) 

Expected results:



Additional info:

Comment 3 Dave Johnson 2017-11-21 15:43:50 UTC
Please assess the impact of this issue and update the severity accordingly.  Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition.

If it's something like a tracker bug where it doesn't matter, please set the severity to Low.

Comment 4 Greg McCullough 2017-11-21 16:14:08 UTC
Pradeep - The poll_clone_complete logging you point out in the description would mean that we are waiting longer on the provider.

Can you confirm:
1) You are provisioning the same template to the same provider
2) The provider is under the same basic load during the test
3) The VMs are being cloned to the same storage?
4) Only one CFME appliance is performance provisioning during the test?

Comment 5 Greg McCullough 2017-11-21 16:42:44 UTC
Another additional suggestion would be to perform the same test on a smaller set of VMs, say 5 at a time, and get timings from that.

I would highly recommend **not** using auto-placement.   You want to use the same host/storage combination for all tests to ensure comparable performance at the provider level.

Comment 6 Dave Johnson 2017-11-21 16:43:58 UTC
Please assess the impact of this issue and update the severity accordingly.  Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition.

If it's something like a tracker bug where it doesn't matter, please set the severity to Low.

Comment 7 Pradeep Kumar Surisetty 2017-11-21 16:44:57 UTC
(In reply to Greg McCullough from comment #4)
> Pradeep - The poll_clone_complete logging you point out in the description
> would mean that we are waiting longer on the provider.
> 
> Can you confirm:
> 1) You are provisioning the same template to the same provider
yes
> 2) The provider is under the same basic load during the test
yes. 
> 3) The VMs are being cloned to the same storage?
yes
> 4) Only one CFME appliance is performance provisioning during the test?
yes

Comment 8 Dave Johnson 2017-11-21 17:02:42 UTC
Please assess the impact of this issue and update the severity accordingly.  Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition.

If it's something like a tracker bug where it doesn't matter, please set the severity to Low.

Comment 9 Pradeep Kumar Surisetty 2017-11-22 04:44:04 UTC
(In reply to Greg McCullough from comment #5)
> Another additional suggestion would be to perform the same test on a smaller
> set of VMs, say 5 at a time, and get timings from that.

With 5 vms, it took similar time with both the providers. 
> 
> I would highly recommend **not** using auto-placement.   You want to use the
> same host/storage combination for all tests to ensure comparable performance
> at the provider level.

Not choosing auto placement

Comment 10 Pradeep Kumar Surisetty 2017-11-22 09:30:03 UTC
Used same host for btoth the experiments (with both appliances) 

Version         Number of VMs     Time taken for provision   
---------       --------------    -----------------------    
5.8.23             30                    8 min
5.9.0              30                   13.54 min

5.8.23              5                    3 min
5.9.0               5                    3 min 


Generic, C&U Data collector Workers increased to 4 & also memory threshold of these workers increased to 800M

Comment 11 Greg McCullough 2017-11-22 11:55:50 UTC
Tina - Can you review this and maybe reach out to Dennis for assistance.  One thought would be to disable the server roles that provisioning is not using, mainly C&U roles, Database Operations, Reporting and SmartState Analysis.

My guess at the moment is this is more of an overall appliance issue that is more noticeable in provisioning, but not directly a provisioning problem.  If this is the case I would recommend this ticket be assigned to the performance team and we can assist them as needed.

Comment 12 Tina Fitzgerald 2017-11-22 14:21:04 UTC
Dennis said he's going to do a quick triage and get back to me.

Comment 13 dmetzger 2017-11-22 15:13:45 UTC
Can this be retested after applying https://github.com/ManageIQ/manageiq/issues/16432?

Comment 14 dmetzger 2017-11-22 15:26:46 UTC
Also, can the test be repeated on 5.9.0 with C&U turned off? The 5.9.0 logs show an issue with workers exceeding memory threshold which was not happening in 5.8.23 (only 2 exceptions there):

Count Worker
3,918 ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker
1,970 MiqGenericWorker

Comment 15 dmetzger 2017-11-22 15:29:58 UTC
Created attachment 1357619 [details]
5.8.23 appliance memory utilization

Comment 16 dmetzger 2017-11-22 15:31:18 UTC
Created attachment 1357625 [details]
5.9.0 appliance memory utilization

Comment 17 dmetzger 2017-11-22 15:34:24 UTC
The 5.8.23 appliance experience 2 worker memory threshold exceptions during the test, while the 5.9.23 appliance experienced 5,888 memory threshold exceptions.

5.8.23:
         MiqGenericWorker, 2
         Total Exceptions, 2

5.9.0:
         ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker, 3,918
         MiqGenericWorker, 1,970
         Total Exceptions, 5,888

Comment 18 dmetzger 2017-11-22 15:54:50 UTC
The logs (workers) seem to indicate the 5.8 and 5.9 tests were not against the same provider, we seem to have a RHEV vs OpenStack:

5.8.23 Max PSS By Worker Type
C&U Metrics Collector for RHEV	                350,048,000
C&U Metrics Processor	                        348,374,000
Ems 1	                                        650,667,000
Ems 2	                                        630,314,000
Event Handler	                                278,256,000
Event Monitor for Providers: perfc	        800,623,000
Event Monitor for Providers: RHHI Test One	470,741,000
Event Monitor for Providers: RHHI Test Two	438,083,000
Generic Worker	                                839,382,000
MIQ Server	                                832,181,000
Priority Worker	                                397,308,000
Refresh Worker for Providers: perfc	      1,000,211,000
Refresh Worker for Providers: RHHI Test One	745,980,000
Refresh Worker for Providers: RHHI Test Two	709,486,000
Reporting Worker	                        266,351,000
Schedule Worker                          	231,704,000
User Interface Worker	                        465,425,000
Web Services Worker	                        261,528,000
Websocket Worker	                        219,711,000

5.9.0 Max PSS By Worker Type:
C&U Metrics Collector for Openstack	                   383,076,000
C&U Metrics Collector for Openstack Network	           233,342,000
C&U Metrics Collector for RHEV	                           920,028,000
C&U Metrics Processor	                                   410,289,000
Event Handler	                                           330,071,000
Event Monitor for Provider: boston-rhev	                   609,768,000
Event Monitor for Provider: boston-rhevm	           806,575,000
Event Monitor for Provider: mgmt-int-osp	           230,397,000
Event Monitor for Provider: mgmt-int-osp Network Manager   227,684,000
Event Monitor for Provider: perfc	                   842,681,000
Generic Worker	                                           905,620,000
MIQ Server	                                           833,942,000
Priority Worker	                                           487,529,000
Refresh Worker for Provider: boston-rhev	           630,986,000
Refresh Worker for Provider: boston-rhevm	           853,628,000
Refresh Worker for Provider: mgmt-int-osp	         1,029,869,000
Refresh Worker for Provider: mgmt-int-osp Cinder Manager 1,030,563,000
Refresh Worker for Provider: mgmt-int-osp Network Manager  994,106,000
Refresh Worker for Provider: mgmt-int-osp Swift Manager	   942,179,000
Refresh Worker for Provider: perfc	                 1,408,898,000
Reporting Worker	                                   334,847,000
Schedule Worker	                                           266,443,000
User Interface Worker	                                   693,336,000
Web Services Worker	                                   388,957,000
Websocket Worker	                                   238,978,000

Comment 19 Pradeep Kumar Surisetty 2017-11-22 16:07:45 UTC
(In reply to dmetzger from comment #18)
> The logs (workers) seem to indicate the 5.8 and 5.9 tests were not against
> the same provider, we seem to have a RHEV vs OpenStack:

They are against same provider RHV 3.5

Earlier OSP was there and deleted one week before

Comment 20 Pradeep Kumar Surisetty 2017-11-22 16:09:51 UTC
(In reply to dmetzger from comment #17)
> The 5.8.23 appliance experience 2 worker memory threshold exceptions during
> the test, while the 5.9.23 appliance experienced 5,888 memory threshold
> exceptions.
> 
> 5.8.23:
>          MiqGenericWorker, 2
>          Total Exceptions, 2
> 
> 5.9.0:
>          ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker,
> 3,918
>          MiqGenericWorker, 1,970
>          Total Exceptions, 5,888

Have noticed these exceptions. 
After exceptions increased Memory of Generic & C & U Data Collectors

Comment 21 dmetzger 2017-11-22 16:17:09 UTC
Would like to see 5.9 data on a fresh appliance (though the OpenStack provider had been removed there last impacts - DB table size / fragmentation / etc.) Both 5.8 and 5.9 numbers should be from clean/fresh appliances for comparison sake.

Comment 22 dmetzger 2017-12-08 14:44:54 UTC
Do we have data from clean runs to compare against yet (https://bugzilla.redhat.com/show_bug.cgi?id=1515906#c21)?

Comment 24 Sudhir Mallamprabhakara 2017-12-19 18:32:57 UTC
Himanshu,

Can you close this bug if it is resolved?

- Sudhir


Note You need to log in before you can comment on or make changes to this bug.