Bug 1859921

Summary:

GenericApiGWTService causing additional load on engine

Product:

[oVirt] ovirt-engine

Reporter:

mlehrer

Component:

General

Assignee:

Hilda Stastna <hstastna>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Pavel Novotny <pnovotny>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.4.1.8

CC:

bugs, dfodor, gdeolive, pnovotny, sgratch

Target Milestone:

ovirt-4.4.6

Keywords:

Performance

Target Release:

---

Flags:

sgratch: ovirt-4.4?
sgratch: planning_ack?
pm-rhel: devel_ack+
pm-rhel: testing_ack+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

ovirt-engine-4.4.6.6

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-05-14 07:28:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1171924

Bug Blocks:

Attachments:

Description	Flags
Collection of trace html reports for idle vm search page	none

Description mlehrer 2020-07-23 10:12:15 UTC

Description of problem:
Open VM page or host on really any populated view (hosts / vms ) on a large enviroment and perform 1 query (any query) and let the page idle. In the background GenericApiGWTService will continually issue POST requests for checking for update on asset changes per the interval set by the user.

The problem is most users will not change this interval, and GenericApiGWTService should not be continually checking for updates on page assets to be refreshed. Additionally and the reason for this bz is an increasing load on postgres for no real need. If the user performs a moderately costly UI query and simply idles on their page, the query will then not only be re-run again at full cost to postgres in the background per UI set interval but worse if the query takes longer than refresh interval to execute then the same query can be executed simultaneously sometimes even in triplicate leading to a bunching effect on the database.

Postgres becomes affected and then the engine slows down leading to high cpu utilization.

The problem is the following:
1) GenericApiGWTService runs by default every 5s
The default should be 30s maybe higher?
2) Continually polling for assets changes for changes in the view is ineffective

Version-Release number of selected component (if applicable):
rhv-4.4.1-11

How reproducible:
Reproducible

Steps to Reproduce:
1. Open UI view
2. Perform Query
3. Idle

Actual results:

In the background GenericApiGWTService will continually issue POST requests for checking for update on asset changes per the interval set by the user.

Postgres will continue to run original search per interval, impact on postgres is dependent on complexity of query and other concurrent work postgres is doing. The basic result is a busier engine.
look at vmstat 1 to see impact on cpu utilization.

Expected results:
Less impact on postgres, GenericApiGWTService to query postgres less frequently. Default value of GenericApiGWTService updates to be greatly increased by default, even consider removing 5s and 10s refresh options.

Additional info:
This bug may at first look seem trivial but on a loaded system doing work we do not have cores to spare to re-run queries because a user has an idle browser, it makes unnecessary load. Please consider improving this or reducing the load caused by GenericApiGWTService refresh impact.

BZ http://bugzilla.redhat.com/show_bug.cgi?id=1858638 is an example of how a poor performing query left idle can have a very big impact on a system simply from UI refreshes.

Comment 1 Sharon Gratch 2020-08-02 18:20:02 UTC

There are few options to handle this:
We need to dig into the code in order to understand why there are too many queries with an idle browser. We are pretty sure that no changes were done on 4.4 frontend UI so I guess it's not a regression.  And also AFAIK no customer complained about that. 
There might be changes on backend queries complexity that influenced the postgres load. This anyway requires investigation.

Was there any filter/search query used for reproducing the issue? Or is it reproduced even when there is no filter used (search field is empty)? If there was a query used then maybe it was a heavy one that made postgres more loaded.
 
Anyway, we can consider supporting a "dynamic refresh interval" solution:
we can gradually increasing the refresh interval default to more than 5 secs only in case the browser was idle for a period of time. We don't want to increase it anyway since the user wants to get an up to date and accurate view and it seems that waiting for more than 5 secs with an active UI is too long.

Another solution is to consider increasing the refresh interval default to more than 5 secs only in case a (heavy) search/query is used.

Comment 2 Sharon Gratch 2020-08-04 11:28:01 UTC

(In reply to Sharon Gratch from comment #1)
> Was there any filter/search query used for reproducing the issue? Or is it
> reproduced even when there is no filter used (search field is empty)? If
> there was a query used then maybe it was a heavy one that made postgres more
> loaded.

@Mordechai, can you please reply on the above? Or maybe just send a screenshot of the browser view? Thanks.

Comment 3 mlehrer 2020-08-04 12:22:38 UTC

(In reply to Sharon Gratch from comment #1)
> There are few options to handle this:
> We need to dig into the code in order to understand why there are too many
> queries with an idle browser. We are pretty sure that no changes were done
> on 4.4 frontend UI so I guess it's not a regression.  And also AFAIK no
> customer complained about that. 

Not saying this is a regression based on current data.

> There might be changes on backend queries complexity that influenced the
> postgres load. This anyway requires investigation.
> 
> Was there any filter/search query used for reproducing the issue? Or is it
> reproduced even when there is no filter used (search field is empty)? If

Should repeat irregardless if query filter is empty or query filter is used - more painful the query the bigger the impact.
When our scale system is back up in (should be in a few days) I will provide updated info to this bz.

> there was a query used then maybe it was a heavy one that made postgres more
> loaded.

Total impact of constant querying from GenericApiGWTService is correlated to cost of query being run.

>  
> Anyway, we can consider supporting a "dynamic refresh interval" solution:
> we can gradually increasing the refresh interval default to more than 5 secs
> only in case the browser was idle for a period of time. We don't want to
> increase it anyway since the user wants to get an up to date and accurate
> view and it seems that waiting for more than 5 secs with an active UI is too
> long.
Agreed, I would prefer a fix to the issue rather than forcing users to have to wait longer for UI refreshes.

> 
> Another solution is to consider increasing the refresh interval default to
> more than 5 secs only in case a (heavy) search/query is used.

once we have the scale lab back up I will compile some traces and ping you offline to give you the full picture, then you'll be able to suggest what works best.
Leaving the needinfo on me until the traces are supplied.

Comment 4 Sharon Gratch 2020-10-12 10:03:35 UTC

(In reply to mlehrer from comment #3)

> > 
> > Another solution is to consider increasing the refresh interval default to
> > more than 5 secs only in case a (heavy) search/query is used.
> 
> once we have the scale lab back up I will compile some traces and ping you
> offline to give you the full picture, then you'll be able to suggest what
> works best.
> Leaving the needinfo on me until the traces are supplied.

Mordechai, any update on this?

Comment 5 mlehrer 2020-10-13 08:42:05 UTC

Created attachment 1721143 [details]
Collection of trace html reports for idle vm search page

The uploaded attachment contains several(single page) html reports that are individually zipped.  Each report correlates to a 'slow trace' event.  Each slow trace event correlates to the api call listed in the report in this:
/ovirt-engine/webadmin/GenericApiGWTService

Open the html report in any browser and note the following:

Breakdown section shows how much time is spent via http, or in jdbc query time or getting a connection, the count row explains how many times this was executed.

Click on or expand "Query Stats" to see a list of what unique queries were run and how long all initiated by this specific instance of /ovirt-engine/webadmin/GenericApiGWTService api call.

In a larger setup the response times and query duration are far worse than the whats shown in these examples as we have less assets loaded, but whats important is to show which queries, and how many are being run and these reports show that.

Lastly there's png file which just shows an overview of the traces happening the VM window is open and a search was made.

In this example the search was: 'cluster = L0_Group_0 and host = f0*'
Traces were generated on system with 3122 vms, and 272 hosts, on 5000 vms, and 500 hosts the duration of the same traces simply take longer as there are more assets.

Please reach out if you have any questions about the reports.

Comment 6 Sharon Gratch 2021-01-13 14:03:04 UTC

After discussing this issue, we suggest that as a first phase solution we can start by increasing the default refresh interval from 5 sec to 10 sec for all grid tables and regardless to filtering query existence.
The user will be able to change the default to 5 sec manually or by user settings configuration (should be implemented as part of user settings https://bugzilla.redhat.com/show_bug.cgi?id=1171924) 
This is an easy solution that might decrease the load without too much effect on the user experience.
We can start by that and check if other suggested solutions mentioned above are required.

Comment 8 Sharon Gratch 2021-04-27 22:51:23 UTC

Please note that the fix is as detailed on comment 6 - increasing the default refresh interval from 5 sec to 10 sec.

Comment 9 Pavel Novotny 2021-05-10 17:05:05 UTC

Verified in
ovirt-engine-4.4.6.6-0.10.el8ev.noarch
ovirt-engine-webadmin-portal-4.4.6.6-0.10.el8ev.noarch

The default data table refresh interval is now 10 seconds (changing this value is not permanent, but this issue is handled in a separate task).
The GenericApiGWTService API calls are reflecting this value. Tried with all the options 5, 10, 20, 30, 60 seconds refresh interval and the API calls
were fired with the same time interval.