Bug 1769463

Summary: [Scale] Slow performance for api/clusters when many networks devices are present
Product: Red Hat Enterprise Virtualization Manager Reporter: mlehrer
Component: ovirt-engineAssignee: eraviv
Status: CLOSED ERRATA QA Contact: mlehrer
Severity: high Docs Contact:
Priority: high    
Version: 4.3.6CC: bugs, dagur, dholler, eraviv, gveitmic, mtessun, rdlugyhe, rgolan
Target Milestone: ovirt-4.4.0Keywords: Performance
Target Release: ---Flags: lsvaty: testing_plan_complete-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rhv-4.4.0-28 Doc Type: Bug Fix
Doc Text:
Previously, in a large environment, the oVirt's REST API's response to a request for the cluster list was slow: This slowness was caused by processing a lot of surplus data from the engine database about out-of-sync hosts on the cluster which eventually was not included in the response. The current release fixes this issue. The query excludes the surplus data, and the API responds quickly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-04 13:21:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1766815    
Attachments:
Description Flags
engine, postgres, relavant trace reports none

Description mlehrer 2019-11-06 16:49:09 UTC
Created attachment 1633388 [details]
engine, postgres, relavant trace reports

Description of problem:
/ovirt-engine/api/clusters takes 196s to complete when many networks are attached to many hosts.

Env has:
10 Clusters each in its own DC
538 Hosts, with 12 FC domains
4k VMs

Networks have been created and attached in large amounts to the following clusters as described:

1 Cluster has 10 physical hosts and have 150 networks assigned per host
1 Cluster has  7 physical hosts and have 150 networks assigned per host
1 Cluster has  8 physical hosts and have 300 networks assigned per host
1 Cluster has 148 nested hosts and have 150 networks assigned per host

[root@rhev-green-01 ~]# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT count(*) from vds_interface_view"
 count
-------
 32582


Issue:
total time 196s of which 144s are spent doing 351,000 queries of the 351,000 queries 1,265 queries are select * from getqosbyqosid(?) this will likely be addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1754363

Testing was also done with a modified dal.jar to remove 'select * from getqosbyqosid(?)' to measure the impact.  
Results in total time drops to 163s, of which 114s are querying with select * from getqosbyqosid(?) removed.
The most significant queries are:


┌───────────────────────────────────────────────────────────────────┬────────────────┬─────────────────┬────────────────────────┬────────────────────┐
│                              Query                                │ Totaltime (ms) │ Execution count │ Time per execution(ms) │ Rows per execution │
├───────────────────────────────────────────────────────────────────┼────────────────┼─────────────────┼────────────────────────┼────────────────────┤
│ select * from getclusterbyclusterid(?, ?, ?)                      │       22,936.7 │          32,011 │                   0.72 │                1.0 │
│ select * from getvdsdynamicbyvdsid(?)                             │       13,423.1 │          32,011 │                   0.42 │                1.0 │
│ select * from getvdsinterfacebyname(?, ?)                         │       12,496.6 │          31,434 │                   0.40 │                1.0 │
│ select * from getvdsstaticbyvdsid(?)                              │        9,655.7 │          32,011 │                   0.30 │                1.0 │
│ select * from getnetworkattachmentbynicidandnetworkid(?, ?)       │        9,651.1 │          32,011 │                   0.30 │                1.0 │
│ select * from getnetwork_clusterbycluster_idandbynetwork_id(?, ?) │        9,140.0 │          32,011 │                   0.29 │                1.0 │
│ select * from getinterfacesbyclusterid(?)                         │        1,332.7 │              10 │                  133.3 │            3,258.2 │
│ select * from getallnetworkbyclusterid(?, ?, ?)                   │           33.8 │              10 │                    3.4 │              144.7 │
└───────────────────────────────────────────────────────────────────┴────────────────┴─────────────────┴────────────────────────┴────────────────────┘
Note the execution count totals per query where only 1 row is returned.
For more details see 'Query stats' in the trace summary attached.


Version-Release number of selected component (if applicable):
rhv-release-4.3.6-7-001.noarch

How reproducible:
reproduces in scale enviroment

Steps to Reproduce:
1. Populate many 150 networks across 150 hosts
2. Issue API /ovirt-engine/api/clusters 


Actual results:
Response takes way to long 100s+

Expected results:
Response time expected between 10-25s max.

Additional info:
System is idle, cpu usage is in an acceptable range of 1.6 cores for ovirt-engine process.
No CPU, memory or IO saturation.
Please look at the trace reports located in the attached zip in slow_cluster_API/traces_as_html_report additional engine, postgres logs are zipped.

Comment 1 Dominik Holler 2019-11-08 11:22:49 UTC
The code introduced to address UI bug 1613702 added the expensive calculation of hosts, which are out of sync, to searchClusters().
The problem in this bug is that the searchClusters() is triggered not only by UI code, but also by REST API implementation.
The fix of this bug would be to avoid the expensive calculation of hosts, which are out of sync, during GET /ovirt-engine/api/clusters.

Comment 5 RHV bug bot 2020-01-08 14:49:49 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Open patch attached]

For more info please contact: infra

Comment 6 RHV bug bot 2020-01-08 15:18:24 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Open patch attached]

For more info please contact: infra

Comment 8 Germano Veit Michel 2020-01-22 03:29:23 UTC
Dominik, seeing the amount of changes here I'm afraid it is unlikely that these can be back-ported to 4.3. Can you comment if there is any chance?

Comment 13 mlehrer 2020-05-11 14:02:08 UTC
#Env: 150 Hosts with 100 networks
4.4.0-0.31.master.el8
HE environment with 200 nested hosts of which 150 hosts have 100 networks per host.
DWH separated and JVM set to 4G with engine set with with 200 pool connections / 250 db connections.



# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c ""SELECT count(*) from vds_interface_view""
 count 
-------
 15825
(1 row)"


Cluster API: ovirt-engine/api/clusters takes 0.4s previously this took minutes.

Response time and engine resources used in acceptable range.

Comment 18 errata-xmlrpc 2020-08-04 13:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247