Bug 1345944

Summary:	[scale] - dashboard landing page generate inefficient query - COALESCE(AVG(cpu_per_socket)
Product:	[oVirt] ovirt-engine-dashboard	Reporter:	Eldad Marciano <emarcian>
Component:	Core	Assignee:	Nobody <nobody>
Status:	CLOSED DEFERRED	QA Contact:	Eldad Marciano <emarcian>
Severity:	high	Docs Contact:
Priority:	high
Version:	unspecified	CC:	amureini, bugs, emarcian, gklein, mperina, oourfali, sradco, ylavi
Target Milestone:	---	Flags:	rule-engine: planning_ack+ rule-engine: devel_ack+ pnovotny: testing_ack+
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-12-20 07:08:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Metrics	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1365443
Bug Blocks:

Description Eldad Marciano 2016-06-13 13:20:45 UTC

Description of problem:
dashboard landing generate the following long query

exec time: 46.2

SQL:
SELECT vm_name AS name, COALESCE(AVG(cpu_per_socket), 0) AS cpu_per_socket, COALESCE(AVG(number_of_sockets), 0) AS number_of_sockets, COALESCE(AVG(cpu_usage_percent), 0) AS cpu_usage_percent, MAX(previous_cpu_percent) AS previous_cpu_percent FROM v4_0_statistics_vms_resources_usage_samples samples INNER JOIN v4_0_configuration_history_vms vms ON samples.vm_id = vms.vm_id LEFT JOIN (SELECT samples.vm_id, COALESCE(AVG(cpu_usage_percent), 0) AS previous_cpu_percent FROM v4_0_statistics_vms_resources_usage_samples samples INNER JOIN v4_0_configuration_history_vms vms ON samples.vm_id = vms.vm_id WHERE samples.vm_status = 1 AND history_datetime >= (CURRENT_TIMESTAMP - INTERVAL '10 minute') AND history_datetime < (CURRENT_TIMESTAMP - INTERVAL '5 minute') AND vms.history_id IN (SELECT MAX(a.history_id) FROM v4_0_configuration_history_vms a GROUP BY a.vm_id) GROUP BY samples.vm_id) AS previous_trend ON vms.vm_id = previous_trend.vm_id WHERE samples.vm_status = 1 AND history_datetime >= (CURRENT_TIMESTAMP - INTERVAL '5 minute') AND history_datetime < CURRENT_TIMESTAMP AND vms.history_id IN (SELECT MAX(a.history_id) FROM v4_0_configuration_history_vms a GROUP BY a.vm_id) GROUP BY vm_name ORDER BY cpu_usage_percent DESC LIMIT 10

detailed report:
https://mojo.redhat.com/docs/DOC-1083326


Version-Release number of selected component (if applicable):
4.0.7

How reproducible:
100%

Steps to Reproduce:
1. large scale env (39 hosts, 3K vms)
2. fetch the dashboard.
3

Actual results:
very long response time and sometimes the UI get stuck 

Expected results:
fast response time, stable UI

Additional info:

Comment 1 Martin Perina 2016-07-04 13:10:41 UTC

Patch is included in ovirt-engine-4.0.1 branch, moving to ON_QA

Comment 2 Pavel Novotny 2016-07-07 08:23:01 UTC

Moving verification to Scale team, as they have the appropriate environment for this.

Comment 3 Eldad Marciano 2016-08-09 11:38:13 UTC

verified failed.

this query runs for 7.5 sec.
LOG:  duration: 7529.091 ms  execute <unnamed>: SELECT vm_name AS name, COALESCE(AVG(cpu_per_socket), 0) AS cpu_per_socket, COALESCE(AVG(number_of_sockets), 
0) AS number_of_sockets, COALESCE(AVG(cpu_usage_percent), 0) AS cpu_usage_percent, MAX(previous_cpu_percent) AS previous_cpu_percent FROM v4_0_statistics_vms
_resources_usage_samples AS samples INNER JOIN v4_0_configuration_history_vms AS vms ON samples.vm_id = vms.vm_id LEFT JOIN (SELECT samples.vm_id, COALESCE(A
VG(cpu_usage_percent), 0) AS previous_cpu_percent FROM v4_0_statistics_vms_resources_usage_samples AS samples INNER JOIN v4_0_configuration_history_vms AS vm
s ON samples.vm_id = vms.vm_id WHERE samples.vm_status = 1 AND samples.history_id IN (SELECT MAX(history_id) FROM v4_0_statistics_vms_resources_usage_samples
 AS a WHERE history_datetime >= CURRENT_TIMESTAMP - INTERVAL '10 minute' AND history_datetime <= CURRENT_TIMESTAMP - INTERVAL '5 minute' GROUP BY a.vm_id) GR
OUP BY samples.vm_id ) AS previous_trend ON vms.vm_id = previous_trend.vm_id WHERE samples.vm_status = 1 AND samples.history_id IN ( SELECT MAX(history_id) F
ROM v4_0_statistics_vms_resources_usage_samples AS a WHERE history_datetime >= CURRENT_TIMESTAMP - INTERVAL '5 minute' AND history_datetime <= CURRENT_TIMEST
AMP GROUP BY a.vm_id ) GROUP BY vm_name ORDER BY cpu_usage_percent DESC LIMIT 10

Comment 4 Oved Ourfali 2016-08-10 12:28:28 UTC

I don't think we can improve this further. 
Are you running with the original number of objects reported in the bug description? 
How long does it take the dashboard to load?

Comment 5 Eldad Marciano 2016-08-10 19:58:44 UTC

(In reply to Oved Ourfali from comment #4)
> I don't think we can improve this further. 
> Are you running with the original number of objects reported in the bug
> description? 
> How long does it take the dashboard to load?

it takes around 40sec to load.
we should re test it with 3.1K vms

Comment 6 Eldad Marciano 2016-09-04 10:41:31 UTC

still reproduced on 4.0.3-1

Comment 7 Oved Ourfali 2016-11-09 10:02:20 UTC

As I understand it is a lot of DWH efforts, it will be fixed by caching.
Moving to future for now.

Comment 8 Oved Ourfali 2016-11-23 09:24:00 UTC

Moving to DWH for their decision.
For now we will focus on caching on the UX side.

Comment 9 Shirly Radco 2016-12-20 07:08:37 UTC

This issue was addressed by caching the queries every 5 minutes.

Comment 10 Oved Ourfali 2016-12-20 07:13:15 UTC

I must admit I think the root cause should be fixed, as this pre-fetching can cause load on the DB if the query is inefficient.