Description of problem: 1) Added RHEV 3.5.5 to CFME 4.8.23. Provisioned 50 vms using template and measured time. It took 13 mins to complete provision of 50 vms 2) Added RHEV 3.5.5 to CFME 4.9.0. Provisioned 50 vms using template and measured time to completion of vm's provision. It took more than 20 mins to complete provision of 50 vms with same provider. Looks like performance regression. Noticed below message count in evm.log during this interval 5.9: ----- 50 Command: [ManageIQ::Providers::Redhat::InfraManager::Provision.poll_clone_complete] 450 Command: [MiqAeEngine.deliver] 5.8 ----- 50 Command: [ManageIQ::Providers::Redhat::InfraManager::Provision.execute] 48 Command: [ManageIQ::Providers::Redhat::InfraManager::Provision.poll_clone_complete] 418 Command: [MiqAeEngine.deliver] Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Provision 50 vms on RHEV from CFME 5.8.23 & then 5.9.0 2. 3. Actual results: 5.9.0 (> 20mins) taking more time than 5.8.23 (13 mins) Expected results: Additional info:
Please assess the impact of this issue and update the severity accordingly. Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition. If it's something like a tracker bug where it doesn't matter, please set the severity to Low.
Pradeep - The poll_clone_complete logging you point out in the description would mean that we are waiting longer on the provider. Can you confirm: 1) You are provisioning the same template to the same provider 2) The provider is under the same basic load during the test 3) The VMs are being cloned to the same storage? 4) Only one CFME appliance is performance provisioning during the test?
Another additional suggestion would be to perform the same test on a smaller set of VMs, say 5 at a time, and get timings from that. I would highly recommend **not** using auto-placement. You want to use the same host/storage combination for all tests to ensure comparable performance at the provider level.
(In reply to Greg McCullough from comment #4) > Pradeep - The poll_clone_complete logging you point out in the description > would mean that we are waiting longer on the provider. > > Can you confirm: > 1) You are provisioning the same template to the same provider yes > 2) The provider is under the same basic load during the test yes. > 3) The VMs are being cloned to the same storage? yes > 4) Only one CFME appliance is performance provisioning during the test? yes
(In reply to Greg McCullough from comment #5) > Another additional suggestion would be to perform the same test on a smaller > set of VMs, say 5 at a time, and get timings from that. With 5 vms, it took similar time with both the providers. > > I would highly recommend **not** using auto-placement. You want to use the > same host/storage combination for all tests to ensure comparable performance > at the provider level. Not choosing auto placement
Used same host for btoth the experiments (with both appliances) Version Number of VMs Time taken for provision --------- -------------- ----------------------- 5.8.23 30 8 min 5.9.0 30 13.54 min 5.8.23 5 3 min 5.9.0 5 3 min Generic, C&U Data collector Workers increased to 4 & also memory threshold of these workers increased to 800M
Tina - Can you review this and maybe reach out to Dennis for assistance. One thought would be to disable the server roles that provisioning is not using, mainly C&U roles, Database Operations, Reporting and SmartState Analysis. My guess at the moment is this is more of an overall appliance issue that is more noticeable in provisioning, but not directly a provisioning problem. If this is the case I would recommend this ticket be assigned to the performance team and we can assist them as needed.
Dennis said he's going to do a quick triage and get back to me.
Can this be retested after applying https://github.com/ManageIQ/manageiq/issues/16432?
Also, can the test be repeated on 5.9.0 with C&U turned off? The 5.9.0 logs show an issue with workers exceeding memory threshold which was not happening in 5.8.23 (only 2 exceptions there): Count Worker 3,918 ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker 1,970 MiqGenericWorker
Created attachment 1357619 [details] 5.8.23 appliance memory utilization
Created attachment 1357625 [details] 5.9.0 appliance memory utilization
The 5.8.23 appliance experience 2 worker memory threshold exceptions during the test, while the 5.9.23 appliance experienced 5,888 memory threshold exceptions. 5.8.23: MiqGenericWorker, 2 Total Exceptions, 2 5.9.0: ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker, 3,918 MiqGenericWorker, 1,970 Total Exceptions, 5,888
The logs (workers) seem to indicate the 5.8 and 5.9 tests were not against the same provider, we seem to have a RHEV vs OpenStack: 5.8.23 Max PSS By Worker Type C&U Metrics Collector for RHEV 350,048,000 C&U Metrics Processor 348,374,000 Ems 1 650,667,000 Ems 2 630,314,000 Event Handler 278,256,000 Event Monitor for Providers: perfc 800,623,000 Event Monitor for Providers: RHHI Test One 470,741,000 Event Monitor for Providers: RHHI Test Two 438,083,000 Generic Worker 839,382,000 MIQ Server 832,181,000 Priority Worker 397,308,000 Refresh Worker for Providers: perfc 1,000,211,000 Refresh Worker for Providers: RHHI Test One 745,980,000 Refresh Worker for Providers: RHHI Test Two 709,486,000 Reporting Worker 266,351,000 Schedule Worker 231,704,000 User Interface Worker 465,425,000 Web Services Worker 261,528,000 Websocket Worker 219,711,000 5.9.0 Max PSS By Worker Type: C&U Metrics Collector for Openstack 383,076,000 C&U Metrics Collector for Openstack Network 233,342,000 C&U Metrics Collector for RHEV 920,028,000 C&U Metrics Processor 410,289,000 Event Handler 330,071,000 Event Monitor for Provider: boston-rhev 609,768,000 Event Monitor for Provider: boston-rhevm 806,575,000 Event Monitor for Provider: mgmt-int-osp 230,397,000 Event Monitor for Provider: mgmt-int-osp Network Manager 227,684,000 Event Monitor for Provider: perfc 842,681,000 Generic Worker 905,620,000 MIQ Server 833,942,000 Priority Worker 487,529,000 Refresh Worker for Provider: boston-rhev 630,986,000 Refresh Worker for Provider: boston-rhevm 853,628,000 Refresh Worker for Provider: mgmt-int-osp 1,029,869,000 Refresh Worker for Provider: mgmt-int-osp Cinder Manager 1,030,563,000 Refresh Worker for Provider: mgmt-int-osp Network Manager 994,106,000 Refresh Worker for Provider: mgmt-int-osp Swift Manager 942,179,000 Refresh Worker for Provider: perfc 1,408,898,000 Reporting Worker 334,847,000 Schedule Worker 266,443,000 User Interface Worker 693,336,000 Web Services Worker 388,957,000 Websocket Worker 238,978,000
(In reply to dmetzger from comment #18) > The logs (workers) seem to indicate the 5.8 and 5.9 tests were not against > the same provider, we seem to have a RHEV vs OpenStack: They are against same provider RHV 3.5 Earlier OSP was there and deleted one week before
(In reply to dmetzger from comment #17) > The 5.8.23 appliance experience 2 worker memory threshold exceptions during > the test, while the 5.9.23 appliance experienced 5,888 memory threshold > exceptions. > > 5.8.23: > MiqGenericWorker, 2 > Total Exceptions, 2 > > 5.9.0: > ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker, > 3,918 > MiqGenericWorker, 1,970 > Total Exceptions, 5,888 Have noticed these exceptions. After exceptions increased Memory of Generic & C & U Data Collectors
Would like to see 5.9 data on a fresh appliance (though the OpenStack provider had been removed there last impacts - DB table size / fragmentation / etc.) Both 5.8 and 5.9 numbers should be from clean/fresh appliances for comparison sake.
Do we have data from clean runs to compare against yet (https://bugzilla.redhat.com/show_bug.cgi?id=1515906#c21)?
Himanshu, Can you close this bug if it is resolved? - Sudhir