Bug 1345944

Summary: [scale] - dashboard landing page generate inefficient query - COALESCE(AVG(cpu_per_socket)
Product: [oVirt] ovirt-engine-dashboard Reporter: Eldad Marciano <emarcian>
Component: CoreAssignee: Nobody <nobody>
Status: CLOSED DEFERRED QA Contact: Eldad Marciano <emarcian>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: amureini, bugs, emarcian, gklein, mperina, oourfali, sradco, ylavi
Target Milestone: ---Flags: rule-engine: planning_ack+
rule-engine: devel_ack+
pnovotny: testing_ack+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-20 07:08:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Metrics RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1365443    
Bug Blocks:    

Description Eldad Marciano 2016-06-13 13:20:45 UTC
Description of problem:
dashboard landing generate the following long query

exec time: 46.2

SQL:
SELECT vm_name AS name, COALESCE(AVG(cpu_per_socket), 0) AS cpu_per_socket, COALESCE(AVG(number_of_sockets), 0) AS number_of_sockets, COALESCE(AVG(cpu_usage_percent), 0) AS cpu_usage_percent, MAX(previous_cpu_percent) AS previous_cpu_percent FROM v4_0_statistics_vms_resources_usage_samples samples INNER JOIN v4_0_configuration_history_vms vms ON samples.vm_id = vms.vm_id LEFT JOIN (SELECT samples.vm_id, COALESCE(AVG(cpu_usage_percent), 0) AS previous_cpu_percent FROM v4_0_statistics_vms_resources_usage_samples samples INNER JOIN v4_0_configuration_history_vms vms ON samples.vm_id = vms.vm_id WHERE samples.vm_status = 1 AND history_datetime >= (CURRENT_TIMESTAMP - INTERVAL '10 minute') AND history_datetime < (CURRENT_TIMESTAMP - INTERVAL '5 minute') AND vms.history_id IN (SELECT MAX(a.history_id) FROM v4_0_configuration_history_vms a GROUP BY a.vm_id) GROUP BY samples.vm_id) AS previous_trend ON vms.vm_id = previous_trend.vm_id WHERE samples.vm_status = 1 AND history_datetime >= (CURRENT_TIMESTAMP - INTERVAL '5 minute') AND history_datetime < CURRENT_TIMESTAMP AND vms.history_id IN (SELECT MAX(a.history_id) FROM v4_0_configuration_history_vms a GROUP BY a.vm_id) GROUP BY vm_name ORDER BY cpu_usage_percent DESC LIMIT 10

detailed report:
https://mojo.redhat.com/docs/DOC-1083326


Version-Release number of selected component (if applicable):
4.0.7

How reproducible:
100%

Steps to Reproduce:
1. large scale env (39 hosts, 3K vms)
2. fetch the dashboard.
3

Actual results:
very long response time and sometimes the UI get stuck 

Expected results:
fast response time, stable UI

Additional info:

Comment 1 Martin Perina 2016-07-04 13:10:41 UTC
Patch is included in ovirt-engine-4.0.1 branch, moving to ON_QA

Comment 2 Pavel Novotny 2016-07-07 08:23:01 UTC
Moving verification to Scale team, as they have the appropriate environment for this.

Comment 3 Eldad Marciano 2016-08-09 11:38:13 UTC
verified failed.

this query runs for 7.5 sec.
LOG:  duration: 7529.091 ms  execute <unnamed>: SELECT vm_name AS name, COALESCE(AVG(cpu_per_socket), 0) AS cpu_per_socket, COALESCE(AVG(number_of_sockets), 
0) AS number_of_sockets, COALESCE(AVG(cpu_usage_percent), 0) AS cpu_usage_percent, MAX(previous_cpu_percent) AS previous_cpu_percent FROM v4_0_statistics_vms
_resources_usage_samples AS samples INNER JOIN v4_0_configuration_history_vms AS vms ON samples.vm_id = vms.vm_id LEFT JOIN (SELECT samples.vm_id, COALESCE(A
VG(cpu_usage_percent), 0) AS previous_cpu_percent FROM v4_0_statistics_vms_resources_usage_samples AS samples INNER JOIN v4_0_configuration_history_vms AS vm
s ON samples.vm_id = vms.vm_id WHERE samples.vm_status = 1 AND samples.history_id IN (SELECT MAX(history_id) FROM v4_0_statistics_vms_resources_usage_samples
 AS a WHERE history_datetime >= CURRENT_TIMESTAMP - INTERVAL '10 minute' AND history_datetime <= CURRENT_TIMESTAMP - INTERVAL '5 minute' GROUP BY a.vm_id) GR
OUP BY samples.vm_id ) AS previous_trend ON vms.vm_id = previous_trend.vm_id WHERE samples.vm_status = 1 AND samples.history_id IN ( SELECT MAX(history_id) F
ROM v4_0_statistics_vms_resources_usage_samples AS a WHERE history_datetime >= CURRENT_TIMESTAMP - INTERVAL '5 minute' AND history_datetime <= CURRENT_TIMEST
AMP GROUP BY a.vm_id ) GROUP BY vm_name ORDER BY cpu_usage_percent DESC LIMIT 10

Comment 4 Oved Ourfali 2016-08-10 12:28:28 UTC
I don't think we can improve this further. 
Are you running with the original number of objects reported in the bug description? 
How long does it take the dashboard to load?

Comment 5 Eldad Marciano 2016-08-10 19:58:44 UTC
(In reply to Oved Ourfali from comment #4)
> I don't think we can improve this further. 
> Are you running with the original number of objects reported in the bug
> description? 
> How long does it take the dashboard to load?

it takes around 40sec to load.
we should re test it with 3.1K vms

Comment 6 Eldad Marciano 2016-09-04 10:41:31 UTC
still reproduced on 4.0.3-1

Comment 7 Oved Ourfali 2016-11-09 10:02:20 UTC
As I understand it is a lot of DWH efforts, it will be fixed by caching.
Moving to future for now.

Comment 8 Oved Ourfali 2016-11-23 09:24:00 UTC
Moving to DWH for their decision.
For now we will focus on caching on the UX side.

Comment 9 Shirly Radco 2016-12-20 07:08:37 UTC
This issue was addressed by caching the queries every 5 minutes.

Comment 10 Oved Ourfali 2016-12-20 07:13:15 UTC
I must admit I think the root cause should be fixed, as this pre-fetching can cause load on the DB if the query is inefficient.