Bug 1769463 - [Scale] Slow performance for api/clusters when many networks devices are present
Summary: [Scale] Slow performance for api/clusters when many networks devices are present
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: eraviv
QA Contact: mlehrer
URL:
Whiteboard:
Depends On:
Blocks: 1766815
TreeView+ depends on / blocked
 
Reported: 2019-11-06 16:49 UTC by mlehrer
Modified: 2024-03-25 15:29 UTC (History)
8 users (show)

Fixed In Version: rhv-4.4.0-28
Doc Type: Bug Fix
Doc Text:
Previously, in a large environment, the oVirt's REST API's response to a request for the cluster list was slow: This slowness was caused by processing a lot of surplus data from the engine database about out-of-sync hosts on the cluster which eventually was not included in the response. The current release fixes this issue. The query excludes the surplus data, and the API responds quickly.
Clone Of:
Environment:
Last Closed: 2020-08-04 13:21:16 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)
engine, postgres, relavant trace reports (1.40 MB, application/gzip)
2019-11-06 16:49 UTC, mlehrer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5057111 0 None None None 2020-05-07 04:16:29 UTC
Red Hat Product Errata RHSA-2020:3247 0 None None None 2020-08-04 13:21:37 UTC
oVirt gerrit 105040 0 master ABANDONED core: NetworkImplUtil - replace vdsDynamic retrieval 2021-01-21 18:28:12 UTC
oVirt gerrit 105041 0 master MERGED core: NetworkImplUtil - refactor getCluster call 2021-01-21 18:28:12 UTC
oVirt gerrit 105042 0 master MERGED core: NetworkImplUtil - less cluster dao calls 2021-01-21 18:28:12 UTC
oVirt gerrit 105043 0 master MERGED core: NetworkImpUtil - less cluster dao calls test 2021-01-21 18:28:53 UTC
oVirt gerrit 105044 0 master ABANDONED core: HostSetupNetworksCommand - less cluster dao calls 2021-01-21 18:28:12 UTC
oVirt gerrit 105045 0 master ABANDONED core: vds\network iface query - less cluster dao calls 2021-01-21 18:28:13 UTC
oVirt gerrit 105046 0 master ABANDONED core: HostNicsUtil - less cluster dao calls 2021-01-21 18:28:13 UTC
oVirt gerrit 105047 0 master ABANDONED core: HostNetworkTopologyPersisterImpl - performance enhance (cluster) 2021-01-21 18:28:13 UTC
oVirt gerrit 105048 0 master ABANDONED core: network param transformer - less cluster dao calls 2021-01-21 18:28:13 UTC
oVirt gerrit 105168 0 master MERGED core: NetworkImplUtil - pull QOS retrieval up 2021-01-21 18:28:54 UTC
oVirt gerrit 105169 0 master MERGED core: NetworkImpUtil - less QOS dao calls 2021-01-21 18:28:13 UTC
oVirt gerrit 105196 0 master MERGED Avoid not required out-of-sync calculation 2021-01-21 18:28:13 UTC
oVirt gerrit 105197 0 master ABANDONED core: NetworkImplUtil - refactor for reuse (QOS) 2021-01-21 18:28:13 UTC
oVirt gerrit 105198 0 master ABANDONED core: HostSetupNetworksCommand - less QOS dao calls 2021-01-21 18:28:13 UTC
oVirt gerrit 105199 0 master ABANDONED core: vds\network iface query - less QOS dao calls 2021-01-21 18:28:13 UTC
oVirt gerrit 105200 0 master ABANDONED core: HostNicUtils - less QOS dao calls 2021-01-21 18:28:13 UTC
oVirt gerrit 105201 0 master ABANDONED core: network param transformer - less QOS dao calls 2021-01-21 18:28:14 UTC
oVirt gerrit 105202 0 master ABANDONED core: NetworkImplUtil - refactor get network attachment 2021-01-21 18:28:14 UTC
oVirt gerrit 105203 0 master ABANDONED core: NetworkAttachmentDao - add getAllForCluster API 2021-01-21 18:28:14 UTC
oVirt gerrit 105204 0 master ABANDONED core: NetworkImpUtil - remove singleton annotation 2021-01-21 18:28:14 UTC
oVirt gerrit 105205 0 master ABANDONED core: NetworkImplUtil - less net-attachment dao calls 2021-01-21 18:28:14 UTC
oVirt gerrit 106084 0 master ABANDONED core: remove unnecessary call to networkCluster dao 2021-01-21 18:28:55 UTC
oVirt gerrit 106168 0 master ABANDONED core: qos_sp.sql - insert or update or delete qos 2021-01-21 18:28:14 UTC
oVirt gerrit 106181 0 master MERGED webadmin: Avoid infinite progress indicator 2021-01-21 18:28:14 UTC

Description mlehrer 2019-11-06 16:49:09 UTC
Created attachment 1633388 [details]
engine, postgres, relavant trace reports

Description of problem:
/ovirt-engine/api/clusters takes 196s to complete when many networks are attached to many hosts.

Env has:
10 Clusters each in its own DC
538 Hosts, with 12 FC domains
4k VMs

Networks have been created and attached in large amounts to the following clusters as described:

1 Cluster has 10 physical hosts and have 150 networks assigned per host
1 Cluster has  7 physical hosts and have 150 networks assigned per host
1 Cluster has  8 physical hosts and have 300 networks assigned per host
1 Cluster has 148 nested hosts and have 150 networks assigned per host

[root@rhev-green-01 ~]# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT count(*) from vds_interface_view"
 count
-------
 32582


Issue:
total time 196s of which 144s are spent doing 351,000 queries of the 351,000 queries 1,265 queries are select * from getqosbyqosid(?) this will likely be addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1754363

Testing was also done with a modified dal.jar to remove 'select * from getqosbyqosid(?)' to measure the impact.  
Results in total time drops to 163s, of which 114s are querying with select * from getqosbyqosid(?) removed.
The most significant queries are:


┌───────────────────────────────────────────────────────────────────┬────────────────┬─────────────────┬────────────────────────┬────────────────────┐
│                              Query                                │ Totaltime (ms) │ Execution count │ Time per execution(ms) │ Rows per execution │
├───────────────────────────────────────────────────────────────────┼────────────────┼─────────────────┼────────────────────────┼────────────────────┤
│ select * from getclusterbyclusterid(?, ?, ?)                      │       22,936.7 │          32,011 │                   0.72 │                1.0 │
│ select * from getvdsdynamicbyvdsid(?)                             │       13,423.1 │          32,011 │                   0.42 │                1.0 │
│ select * from getvdsinterfacebyname(?, ?)                         │       12,496.6 │          31,434 │                   0.40 │                1.0 │
│ select * from getvdsstaticbyvdsid(?)                              │        9,655.7 │          32,011 │                   0.30 │                1.0 │
│ select * from getnetworkattachmentbynicidandnetworkid(?, ?)       │        9,651.1 │          32,011 │                   0.30 │                1.0 │
│ select * from getnetwork_clusterbycluster_idandbynetwork_id(?, ?) │        9,140.0 │          32,011 │                   0.29 │                1.0 │
│ select * from getinterfacesbyclusterid(?)                         │        1,332.7 │              10 │                  133.3 │            3,258.2 │
│ select * from getallnetworkbyclusterid(?, ?, ?)                   │           33.8 │              10 │                    3.4 │              144.7 │
└───────────────────────────────────────────────────────────────────┴────────────────┴─────────────────┴────────────────────────┴────────────────────┘
Note the execution count totals per query where only 1 row is returned.
For more details see 'Query stats' in the trace summary attached.


Version-Release number of selected component (if applicable):
rhv-release-4.3.6-7-001.noarch

How reproducible:
reproduces in scale enviroment

Steps to Reproduce:
1. Populate many 150 networks across 150 hosts
2. Issue API /ovirt-engine/api/clusters 


Actual results:
Response takes way to long 100s+

Expected results:
Response time expected between 10-25s max.

Additional info:
System is idle, cpu usage is in an acceptable range of 1.6 cores for ovirt-engine process.
No CPU, memory or IO saturation.
Please look at the trace reports located in the attached zip in slow_cluster_API/traces_as_html_report additional engine, postgres logs are zipped.

Comment 1 Dominik Holler 2019-11-08 11:22:49 UTC
The code introduced to address UI bug 1613702 added the expensive calculation of hosts, which are out of sync, to searchClusters().
The problem in this bug is that the searchClusters() is triggered not only by UI code, but also by REST API implementation.
The fix of this bug would be to avoid the expensive calculation of hosts, which are out of sync, during GET /ovirt-engine/api/clusters.

Comment 5 RHV bug bot 2020-01-08 14:49:49 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Open patch attached]

For more info please contact: infra

Comment 6 RHV bug bot 2020-01-08 15:18:24 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Open patch attached]

For more info please contact: infra

Comment 8 Germano Veit Michel 2020-01-22 03:29:23 UTC
Dominik, seeing the amount of changes here I'm afraid it is unlikely that these can be back-ported to 4.3. Can you comment if there is any chance?

Comment 13 mlehrer 2020-05-11 14:02:08 UTC
#Env: 150 Hosts with 100 networks
4.4.0-0.31.master.el8
HE environment with 200 nested hosts of which 150 hosts have 100 networks per host.
DWH separated and JVM set to 4G with engine set with with 200 pool connections / 250 db connections.



# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c ""SELECT count(*) from vds_interface_view""
 count 
-------
 15825
(1 row)"


Cluster API: ovirt-engine/api/clusters takes 0.4s previously this took minutes.

Response time and engine resources used in acceptable range.

Comment 18 errata-xmlrpc 2020-08-04 13:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247


Note You need to log in before you can comment on or make changes to this bug.