Bug 1880698

Summary: Performance of the `QueryBrowser` graphs have degraded since 4.5
Product: OpenShift Container Platform Reporter: Andrew Pickering <anpicker>
Component: MonitoringAssignee: Andrew Pickering <anpicker>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: alegrand, anpicker, bpeterse, erooth, juzhao, kakkoyun, lcosic, mloibl, pkrupa, spadgett, surbania
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:42:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1804922    

Description Andrew Pickering 2020-09-19 03:39:32 UTC
Graphs on the Alerting, Metrics and Dashboards pages for both Dev and Admin perspectives all use the `QueryBrowser` component to render their graphs. The render time of this component has got slower since the 4.5 release.

Comment 2 hongyan li 2020-09-24 02:28:47 UTC
I tested with Performance tab of chrome developer tools, you fix is in build 4.6.0-0.nightly-2020-09-21-230455

1. Installed ocp cluster with payload 4.6.0-0.nightly-2020-09-21-093308 which doesn't include your fix
2. Installed ocp cluster with payload 4.6.0-0.nightly-2020-09-23-022756 which include your fix
3. Perform example query and collect performance data, see no performance enhancement
       sort_desc(sum(sum_over_time(ALERTS{alertstate="firing"}[24h])) by (alertname))
4. Perform query query with big data and collect performance data, filed product bug https://bugzilla.redhat.com/show_bug.cgi?id=1880698
       cluster_quantile:apiserver_request_duration_seconds:histogram_quantile


I collect the following data five time for each ENV, suppose the Painting time or the Rendering time should be enhanced, see no enhancement for all the times
38 ms Loading
3204 ms Scripting
422 ms Rendering
24 ms Painting
561 ms System
4622 ms Idle
8871 ms Total

Comment 3 hongyan li 2020-09-24 03:37:08 UTC
Full test result
https://files.slack.com/files-pri/T027F3GAJ-F01BZTTSYJC/image.png

Comment 4 hongyan li 2020-09-24 03:42:31 UTC
The following bug is not caused by the current fix. 
https://bugzilla.redhat.com/show_bug.cgi?id=1880698

I reopen the bug for my test see no performance enhancement.

Comment 5 hongyan li 2020-09-24 04:46:21 UTC
Will do performance again with two cluster with same data series

Comment 6 hongyan li 2020-09-24 07:55:59 UTC
Launch chrome in incognito mode and collect performance data with Chrome developer tools, test results is as below: From the results we can know that the performance is enhanced with bug's fix

sort_desc(sum(sum_over_time(ALERTS{alertstate="firing"}[24h])) by (alertname))		
fix is not in 2 time series		
1 3204	ms	Scripting
2 3160	ms	Scripting
3 3007	ms	Scripting
4 2863	ms	Scripting
5 2806	ms	Scripting
total:15040		
		
fix is in 2 time series		
2044	ms	Scripting
2224	ms	Scripting
2314	ms	Scripting
2359	ms	Scripting
2468	ms	Scripting
11409(total)		
		
topk(5, cluster_quantile:apiserver_request_duration_seconds:histogram_quantile)		
fix is not in(634 time series)		
13479	Ms 	Scripting
4164	ms	Scripting
11556	ms	Scripting
5993	ms	Scripting
5756	ms	Scripting
40948(total)		
fix is in (742 time series)		
8457	ms	Scripting
9596	 ms	Scripting
3134	ms	Scripting
3856	ms	Scripting
8953	ms	Scripting
33996(total)

Comment 7 Samuel Padgett 2020-09-25 13:33:15 UTC
We have a follow on fix that improves performance a bit more. Moving back to assigned.

Comment 9 hongyan li 2020-09-29 07:15:09 UTC
Test with payload 4.6.0-0.nightly-2020-09-28-171716
sort_desc(sum(sum_over_time(ALERTS{alertstate="firing"}[24h])) by (alertname)) 5 time series
2062	msScripting
2117	msScripting
2187	msScripting
2251	msScripting
1756	msScripting
10373	


cluster_quantile:apiserver_request_duration_seconds:histogram_quantile 858 timeseries	
2161	msScripting
1550	msScripting
2761	msScripting
2350	msScripting
2587	msScripting
11409

Comment 10 hongyan li 2020-09-29 07:18:27 UTC
From the results, we can know that performance improve greatly when the query returns big data

Comment 11 Samuel Padgett 2020-10-01 12:18:18 UTC
*** Bug 1795401 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2020-10-27 16:42:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196