Bug 1754363
Summary: | [Scale] Engine generates excessive amount of dns configuration related sql queries | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
Component: | ovirt-engine | Assignee: | Dominik Holler <dholler> |
Status: | CLOSED ERRATA | QA Contact: | mlehrer |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.3.5 | CC: | dagur, dholler, khuh, mburman, mlehrer, mtessun, pelauter, rgolan, sgoodman |
Target Milestone: | ovirt-4.4.0 | Keywords: | Performance |
Target Release: | --- | Flags: | dagur:
needinfo+
dagur: needinfo+ lsvaty: testing_plan_complete- |
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | rhv-4.4.0-28 | Doc Type: | Bug Fix |
Doc Text: |
With this release, the number of DNS configuration SQL queries that the Red Hat Virtualization Manager runs is significantly reduced, which improves the Manager's ability to scale.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-08-04 13:20:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1766815 |
Description
Germano Veit Michel
2019-09-23 03:43:43 UTC
We are reproducing an enormous amount of small queries that are very similar to the same queries which Germano cites in this bz. Specifically occurs when: - the UI host view is shown we see a degradation of 8-10s from our usual 2-4s - Host details of the attached networks - /ovirt-engine/api/clusters #takes over 3 minutes due to these queries. The following traces are available here for dev review. https://drive.google.com/open?id=1d5rF58yLIU4bccZcq1DSaCvBcRx6cnEF Dominik shared a dal.jar version with specific queries removed to test the impact to UI and API behavior. This resulted in some reduced response times for API calls and some UI behavior. ┌──────────────────────────────────────────────────────┬────────────────┬─────────────────┬ │ Query │reduced queries │ no fix │ ├──────────────────────────────────────────────────────┼────────────────┼─────────────────┼ │ /ovirt-engine/api/clusters/ │ 163.7s │ 196s │ │ /api/hosts │ 1.3 - 1.6s │ 2.8s │ │ /api/hosts/819854ed-330e-4a42-b4c6-c78ee044dc8c/nics │ 1.3s │ 2.4s │ └──────────────────────────────────────────────────────┴────────────────┴─────────────────┴ Specifically commented out were: select * from getqosbyqosid and dns queries removed. In addition to these specific small queries, there are additional queries that occur when the UI is open on a host view or hosts main page from GenericApiGWTService. These long-running GenericApiGWTService impact UI performance and lead to sluggish engine and postgres performance due to the number of queries being executed. ┌───────────────────────────────────────────────────────────────────┬────────────────┬─────────────────┬────────────────────────┬────────────────────┐ │ Query │ Totaltime (ms) │ Execution count │ Time per execution(ms) │ Rows per execution │ ├───────────────────────────────────────────────────────────────────┼────────────────┼─────────────────┼────────────────────────┼────────────────────┤ │ select * from getclusterbyclusterid(?, ?, ?) │ 28,395.9 │ 32,011 │ 0.89 │ 1.0 │ │ select * from getvdsdynamicbyvdsid(?) │ 16,080.8 │ 32,011 │ 0.50 │ 1.0 │ │ select * from getvdsinterfacebyname(?, ?) │ 15,459.5 │ 31,434 │ 0.49 │ 1.0 │ │ select * from getvdsstaticbyvdsid(?) │ 13,952.7 │ 32,011 │ 0.44 │ 1.0 │ │ select * from getnetworkattachmentbynicidandnetworkid(?, ?) │ 12,022.2 │ 32,011 │ 0.38 │ 1.0 │ │ select * from getnetwork_clusterbycluster_idandbynetwork_id(?, ?) │ 11,276.4 │ 32,011 │ 0.35 │ 1.0 │ └───────────────────────────────────────────────────────────────────┴────────────────┴─────────────────┴────────────────────────┴────────────────────┘ Over time these GenericApiGWTService calls take longer, to complete and take more engine resources and have a larger impact as shown in image 'Slow_running_GenericApiGWTService_2.png' which shows the impact of sluggish responses bunching and taking further time and system resources because of the queries. Can you please review https://drive.google.com/open?id=1d5rF58yLIU4bccZcq1DSaCvBcRx6cnEF Click 'reduce_dns_qos_queries' folder and look at query stats and trace entries for - slow_ovirt-engine_GenericApiGWTService.html - slow_ovirt-engine_GenericApiGWTService_example2.html Additionally, I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1769463 for slow api/cluster calls which likely relates to this issue. Trace files contain traces of queries, and stack traces. WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops Dominik, I see the changes on 4.3 were abandoned here. Can you comment if there is any chance of 4.3.z on this one as well? Thanks! WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops Customer has 150 Hosts and about 100 networks attached on each host. Uses external monitoring via API, checking networkattachments for each host. Issues reported: - takes 1 minute to load the hosts - too many small queries leading to high load on the engine caused by postgres Reproduced in 4.3 in Scale lab with 150 Hosts and 100 networks - UI Host view took 10s+ high load - UI Host Networks high load - /ovirt-engine/api/clusters 196s - /api/hosts 2.8s - /api/hosts/819854ed-330e-4a42-b4c6-c78ee044dc8c/nics 2.4s - UI -> browsing hosts 28s - UI -> click host detail 25s - UI -> Host Reload from details screen 5s - UI -> select host network details 10-12s to update - Too many small queries causing high load In 4.4 there have been fixes from network dev which has addressed the above issues except for cluster UI view [1] Comparing the data from https://bugzilla.redhat.com/show_bug.cgi?id=1811869#c3 We see improved times on all flows related to this bz when comparing 4.4 data shown in this table below and above mentioned 4.3 items: +-------------------------------------------------------------------------------------+-------------------+------------------+ | 150 Hosts / 100 networks per host | Firefox | Chrome | +-------------------------------------------------------------------------------------+-------------------+------------------+ | UI click Hosts | 5s | 6s | | UI select specific host | 1s | 1s | | UI View list of Interface | 1s | 2s | | UI Expand interfaces of host | 1s | 1s | | UI Click Events tab on hosts | 1s | 2s | | UI return to Hosts view from host detail view | 5s | 6s | | UI Load Dashboard Page from Hosts | 1s | 4s | | UI Setup Host networks | 12s | 14s | | | | | | | client across WAN | client on engine | | API /ovirt-engine/api/hosts | 10s | 0.5s | | API /ovirt-engine/api/hosts/2ece4045-d94d-4f25-afd7-f6966e09ec12/networkattachments | 6s | 0.3s | | API /ovirt-engine/api/hosts/2ece4045-d94d-4f25-afd7-f6966e09ec12/nics | 4s | 1.1s | +-------------------------------------------------------------------------------------+-------------------+------------------+ Measurements shown above used hosts which had 100, and another host with had 200 networks per interface. Summarizing related BZs Total # of getqosbyqosid reduced 1811865 – [Scale] Host Monitoring generates excessive amount of qos related sql queries => idle query check comparison -- Verified #API query time of clusters with lots of networks reduced from minutes to less than 1s 1769463 – [Scale] Slow performance for api/clusters when many networks devices are present => API call of clusters API -- Verified #Acceptable UI / API performance with 150 hosts with 100 networks each 1811869 – [Scale] Webadmin\REST for host interface list response time is too long because of excessive amount of qos related sql queries => looking at hosts interface UI & API call for host interface api call - Verified #Able to add work with host with 200 networks 1766193 – [Scale] Attaching many networks to host Interface in single api call fails => network API batches and restart - Verified [1]# UI Cluster view - still problematic moved back to 'Post' 1811866 – [Scale] Webadmin clusters list view response time is too long because of excessive amount of qos related sql queries => UI list clusters view Due to the improvements seen in the above moving this bz to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247 |