Bug 1754363

Summary:	[Scale] Engine generates excessive amount of dns configuration related sql queries
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	ovirt-engine	Assignee:	Dominik Holler <dholler>
Status:	CLOSED ERRATA	QA Contact:	mlehrer
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.3.5	CC:	dagur, dholler, khuh, mburman, mlehrer, mtessun, pelauter, rgolan, sgoodman
Target Milestone:	ovirt-4.4.0	Keywords:	Performance
Target Release:	---	Flags:	dagur: needinfo+ dagur: needinfo+ lsvaty: testing_plan_complete-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	rhv-4.4.0-28	Doc Type:	Bug Fix
Doc Text:	With this release, the number of DNS configuration SQL queries that the Red Hat Virtualization Manager runs is significantly reduced, which improves the Manager's ability to scale.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-04 13:20:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Network	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1766815

Description Germano Veit Michel 2019-09-23 03:43:43 UTC

Description of problem:

At any given point in time, a customer postgres stat shows several of DNS/QoS config related queries running, exceeding by far all the others:

~~~
72 select * from getnameserversbydnsresolverconfigurationid($1)
40 select * from getqosbyqosid($1)
25 select * from getdnsresolverconfigurationbydnsresolverconfigurationid($1)
11 select * from getvdsbyvdsid($1,
 6 select * from getvdsstaticbyvdsid($1)
~~~

The customer has 136 Hosts in total and has external monitoring via API, checking networkattachments too for each host.

I've checked this in an idle environment (1 VM, 3 Hosts), and these calls are always at the top:

Over a 3 minute period, the top 5 are the following:
~~~
    597 select * from getnameserversbydnsresolverconfigurationid($1)
    596 select * from getdnsresolverconfigurationbydnsresolverconfigurationid($1)
    300 select * from getstoragedomainidsbystoragepoolidandstatus($1,
    299 select * from getvdsbyvdsid($1,
    180 select * from public.updatevds_interface_statistics(CAST
~~~

Over a 50 minute period:
~~~
   6563 select * from getnameserversbydnsresolverconfigurationid($1)
   6563 select * from getdnsresolverconfigurationbydnsresolverconfigurationid($1)
   3411 select * from getvdsbyvdsid($1,
   3045 select * from getstoragedomainidsbystoragepoolidandstatus($1,
   2448 select * from getqosbyqosid($1)
   1827 select * from public.updatevds_interface_statistics(CAST
   1573 select * from getclusterbyclusterid($1,
   1446 select * from getbrickdetailsbyid($1)
   1188 select * from getglusterserverbyserverid($1)
~~~

The top 2 correspond to ~25% of all SQL executed by the engine. Apparently these were not present on RHV 4.1 and lower and have been introduced by RFE BZ1160667. It looks a bit excessive that with just 3 hosts we have so many of these. I wonder if this is not scaling well as the hosts go up and generating too much load when compared to 4.1. The customer is complaining about 4.2 being slower than 4.1 and pg_stat repeatedly shows these queries being executed. Regardless, this looks like an opportunity for improvement.

Version-Release number of selected component (if applicable):
rhvm-4.3.5.4-0.1.el7.noarch

How reproducible:
Always

Steps to Reproduce:
Have idle environment, enable postgres logging

Actual results:
DNS related SQL are 4x the next on the ranking.

Expected results:
Does the engine really need to read this all the time from the database?

Additional info:
The calls are very regular, and sometimes the engine floods postgres with these, with just 3 hosts, see:

2019-09-23 03:18:16.033 UTC LOG:  execute S_3: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.033 UTC DETAIL:  parameters: $1 = 'd9d14152-fa40-4fb8-b806-4355b47c7fc9'
2019-09-23 03:18:16.033 UTC LOG:  execute S_3: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.033 UTC DETAIL:  parameters: $1 = 'b3da8e69-d592-4d66-8440-244cbf60957a'
--
2019-09-23 03:18:16.038 UTC LOG:  execute S_10: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.038 UTC DETAIL:  parameters: $1 = 'c68b30b0-25e5-47ee-8b0b-d3bb485bdd17'
--
2019-09-23 03:18:16.092 UTC LOG:  execute S_10: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.092 UTC DETAIL:  parameters: $1 = 'c68b30b0-25e5-47ee-8b0b-d3bb485bdd17'
--
2019-09-23 03:18:16.114 UTC LOG:  execute S_3: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.114 UTC DETAIL:  parameters: $1 = 'b3da8e69-d592-4d66-8440-244cbf60957a'
--
2019-09-23 03:18:16.275 UTC LOG:  execute S_10: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.275 UTC DETAIL:  parameters: $1 = 'b3da8e69-d592-4d66-8440-244cbf60957a'
--
2019-09-23 03:18:16.278 UTC LOG:  execute S_10: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.278 UTC DETAIL:  parameters: $1 = 'c68b30b0-25e5-47ee-8b0b-d3bb485bdd17'
--
2019-09-23 03:18:16.280 UTC LOG:  execute S_10: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.280 UTC DETAIL:  parameters: $1 = 'd9d14152-fa40-4fb8-b806-4355b47c7fc9'
--
2019-09-23 03:18:16.285 UTC LOG:  execute S_10: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.285 UTC DETAIL:  parameters: $1 = 'b3da8e69-d592-4d66-8440-244cbf60957a'
--
2019-09-23 03:18:16.287 UTC LOG:  execute S_10: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.287 UTC DETAIL:  parameters: $1 = 'c68b30b0-25e5-47ee-8b0b-d3bb485bdd17'
--
2019-09-23 03:18:16.290 UTC LOG:  execute S_10: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.290 UTC DETAIL:  parameters: $1 = 'd9d14152-fa40-4fb8-b806-4355b47c7fc9'
--
2019-09-23 03:18:16.750 UTC LOG:  execute S_3: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.750 UTC DETAIL:  parameters: $1 = 'd9d14152-fa40-4fb8-b806-4355b47c7fc9'
--
2019-09-23 03:18:16.752 UTC LOG:  execute S_3: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.752 UTC DETAIL:  parameters: $1 = 'c68b30b0-25e5-47ee-8b0b-d3bb485bdd17'
--
2019-09-23 03:18:16.754 UTC LOG:  execute S_3: select * from  getnameserversbydnsresolverconfigurationid($1)
2019-09-23 03:18:16.754 UTC DETAIL:  parameters: $1 = 'b3da8e69-d592-4d66-8440-244cbf60957a'

Comment 6 mlehrer 2019-11-04 21:31:20 UTC

We are reproducing an enormous amount of small queries that are very similar to the same queries which Germano cites in this bz.

Specifically occurs when:

- the UI host view is shown we see a degradation of 8-10s from our usual 2-4s 
- Host details of the attached networks
- /ovirt-engine/api/clusters #takes over 3 minutes due to these queries.  


The following traces are available here for dev review.
https://drive.google.com/open?id=1d5rF58yLIU4bccZcq1DSaCvBcRx6cnEF

Comment 7 mlehrer 2019-11-06 17:47:27 UTC

Dominik shared a dal.jar version with specific queries removed to test the impact to UI and API behavior.
This resulted in some reduced response times for API calls and some UI behavior.

┌──────────────────────────────────────────────────────┬────────────────┬─────────────────┬
│                        Query                         │reduced queries │    no fix       │
├──────────────────────────────────────────────────────┼────────────────┼─────────────────┼
│ /ovirt-engine/api/clusters/                          │    163.7s      │    196s         │
│ /api/hosts                                           │  1.3 - 1.6s    │    2.8s         │
│ /api/hosts/819854ed-330e-4a42-b4c6-c78ee044dc8c/nics │     1.3s       │    2.4s         │
└──────────────────────────────────────────────────────┴────────────────┴─────────────────┴
Specifically commented out were:  select * from getqosbyqosid and dns queries removed.


In addition to these specific small queries, there are additional queries that occur when the UI is open on a host view or hosts main page from GenericApiGWTService.
These long-running GenericApiGWTService impact UI performance and lead to sluggish engine and postgres performance due to the number of queries being executed.

┌───────────────────────────────────────────────────────────────────┬────────────────┬─────────────────┬────────────────────────┬────────────────────┐
│                              Query                                │ Totaltime (ms) │ Execution count │ Time per execution(ms) │ Rows per execution │
├───────────────────────────────────────────────────────────────────┼────────────────┼─────────────────┼────────────────────────┼────────────────────┤
│ select * from getclusterbyclusterid(?, ?, ?)                      │       28,395.9 │          32,011 │                   0.89 │                1.0 │
│ select * from getvdsdynamicbyvdsid(?)                             │       16,080.8 │          32,011 │                   0.50 │                1.0 │
│ select * from getvdsinterfacebyname(?, ?)                         │       15,459.5 │          31,434 │                   0.49 │                1.0 │
│ select * from getvdsstaticbyvdsid(?)                              │       13,952.7 │          32,011 │                   0.44 │                1.0 │
│ select * from getnetworkattachmentbynicidandnetworkid(?, ?)       │       12,022.2 │          32,011 │                   0.38 │                1.0 │
│ select * from getnetwork_clusterbycluster_idandbynetwork_id(?, ?) │       11,276.4 │          32,011 │                   0.35 │                1.0 │
└───────────────────────────────────────────────────────────────────┴────────────────┴─────────────────┴────────────────────────┴────────────────────┘
Over time these GenericApiGWTService calls take longer, to complete and take more engine resources and have a larger impact as shown in image 'Slow_running_GenericApiGWTService_2.png' which shows the impact of sluggish responses bunching and taking further time and system resources because of the queries.

Can you please review https://drive.google.com/open?id=1d5rF58yLIU4bccZcq1DSaCvBcRx6cnEF 
Click 'reduce_dns_qos_queries' folder and look at query stats and trace entries for
- slow_ovirt-engine_GenericApiGWTService.html
- slow_ovirt-engine_GenericApiGWTService_example2.html

Additionally, I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1769463 for slow api/cluster calls which likely relates to this issue.

Trace files contain traces of queries, and stack traces.

Comment 10 RHV bug bot 2020-01-08 14:48:01 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 11 RHV bug bot 2020-01-08 15:15:25 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 12 Germano Veit Michel 2020-01-22 03:31:04 UTC

Dominik, I see the changes on 4.3 were abandoned here. Can you comment if there is any chance of 4.3.z on this one as well?

Thanks!

Comment 14 RHV bug bot 2020-01-24 19:49:46 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 19 mlehrer 2020-05-24 08:34:47 UTC

Customer has 150 Hosts and about 100 networks attached on each host.
Uses external monitoring via API, checking networkattachments for each host.

Issues reported:
 - takes 1 minute to load the hosts
 - too many small queries leading to high load on the engine caused by postgres

Reproduced in 4.3 in Scale lab with 150 Hosts and 100 networks
- UI Host view took 10s+ high load
- UI Host Networks high load
- /ovirt-engine/api/clusters 196s
- /api/hosts 2.8s
- /api/hosts/819854ed-330e-4a42-b4c6-c78ee044dc8c/nics 2.4s
- UI -> browsing hosts 28s 
- UI -> click host detail 25s
- UI -> Host Reload from details screen 5s
- UI -> select host network details 10-12s to update
- Too many small queries causing high load

In 4.4 there have been fixes from network dev which has addressed the above issues except for cluster UI view [1]
Comparing the data from https://bugzilla.redhat.com/show_bug.cgi?id=1811869#c3
We see improved times on all flows related to this bz when comparing 4.4 data shown in this table below and above mentioned 4.3 items:

+-------------------------------------------------------------------------------------+-------------------+------------------+
|         150 Hosts / 100 networks per host                                           |      Firefox      |      Chrome      |
+-------------------------------------------------------------------------------------+-------------------+------------------+
| UI click Hosts                                                                      | 5s                | 6s               |
| UI select specific host                                                             | 1s                | 1s               |
| UI View list of Interface                                                           | 1s                | 2s               |
| UI Expand interfaces of host                                                        | 1s                | 1s               |
| UI Click Events tab on hosts                                                        | 1s                | 2s               |
| UI return to Hosts view from host detail view                                       | 5s                | 6s               |
| UI Load Dashboard Page from Hosts                                                   | 1s                | 4s               |
| UI Setup Host networks                                                              | 12s               | 14s              |
|                                                                                     |                   |                  |
|                                                                                     | client across WAN | client on engine |
| API /ovirt-engine/api/hosts                                                         | 10s               | 0.5s             |
| API /ovirt-engine/api/hosts/2ece4045-d94d-4f25-afd7-f6966e09ec12/networkattachments | 6s                | 0.3s             |
| API /ovirt-engine/api/hosts/2ece4045-d94d-4f25-afd7-f6966e09ec12/nics               | 4s                | 1.1s             |
+-------------------------------------------------------------------------------------+-------------------+------------------+

Measurements shown above used hosts which had 100, and another host with had 200 networks per interface.

Summarizing related BZs


Total # of getqosbyqosid reduced
 1811865 – [Scale] Host Monitoring generates excessive amount of qos related sql queries => idle query check comparison -- Verified

#API query time of clusters with lots of networks reduced from minutes to less than 1s
1769463 – [Scale] Slow performance for api/clusters when many networks devices are present => API call of clusters API  -- Verified


#Acceptable UI  / API performance with 150 hosts with 100 networks each
1811869 – [Scale] Webadmin\REST for host interface list response time is too long because of excessive amount of qos related sql queries => looking at hosts interface UI & API call for host interface api call - Verified

#Able to add work with host with 200 networks
1766193 – [Scale] Attaching many networks to host Interface in single api call fails => network API batches and restart  - Verified

[1]# UI Cluster view - still problematic moved back to 'Post'
1811866 – [Scale] Webadmin clusters list view response time is too long because of excessive amount of qos related sql queries => UI list clusters view


Due to the improvements seen in the above moving this bz to verified.

Comment 23 errata-xmlrpc 2020-08-04 13:20:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247