1795457 – RHV-M causing high load on PostgreSQL DB after upgrade to 4.2

Bug 1795457 - RHV-M causing high load on PostgreSQL DB after upgrade to 4.2

Summary: RHV-M causing high load on PostgreSQL DB after upgrade to 4.2

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-web-ui
Sub Component:
Version:	4.2.8
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.4.5-1
Target Release:	---
Assignee:	Ben Amsalem
QA Contact:	David Vaanunu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-28 02:00 UTC by hhaberma
Modified:	2023-10-06 19:04 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ovirt-web-ui-1.6.8-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-04-14 11:43:08 UTC
oVirt Team:	UX
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-web-ui issues 1240	None	closed	Change VM/Pool fetch page size and optimize refresh scheduling	2021-01-31 11:39:32 UTC
Github	oVirt ovirt-web-ui issues 1242	None	closed	Fix VM and Pool pagination handling and have infinite scroll use it properly	2021-01-31 11:39:32 UTC
Github	oVirt ovirt-web-ui pull 1238	None	closed	Update fetch page size, refactor background refresh sagas	2021-01-31 11:39:33 UTC
Github	oVirt ovirt-web-ui pull 1342	None	open	Change Consoles data fetching time - fetch Consoles data on demand instead of fetch on login	2021-01-31 11:39:33 UTC
Red Hat Knowledge Base (Solution)	4937011	None	None	None	2020-03-30 10:41:07 UTC
Red Hat Product Errata	RHSA-2021:1186	None	None	None	2021-04-14 11:43:43 UTC
oVirt gerrit	112536	master	MERGED	packaging: Add new value ClientModeConsoleDefault to config DB table	2021-02-02 13:47:35 UTC
oVirt gerrit	114302	master	MERGED	restapi: retrieve current graphic consoles	2021-05-15 15:51:36 UTC

Description hhaberma 2020-01-28 02:00:42 UTC

Description of problem:

Customer had recently upgrading to RHVM 4.2 from 4.1. After upgrade the customer observed many postgresql processes after upgrade and CPU Load was abnormally triggered to 60 percent.

Currently, the customer was able to get performance back on manager after disabling vacuum option in postgresql.conf and stopping DWH with manager. Both of these were most of postgresql processes observed after upgrade.

In the engine log, the same signature in https://bugzilla.redhat.com/show_bug.cgi?id=1666610 is found happening It's happening several times a second.

~~~~
2020-01-25 07:45:59,521-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '2a253524-3e07-4b00-ae6c-742f58ad9126' is missing 2 prestarted VMs, attempting to prestart 2 VMs
2020-01-25 07:45:59,522-03 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '2a253524-3e07-4b00-ae6c-742f58ad9126'
2020-01-25 07:45:59,522-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] No VMs available for prestarting
2020-01-25 07:45:59,564-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '3a5968e2-ee7b-41c5-8bc5-4568c08e858b' is missing 1 prestarted VMs, attempting to prestart 1 VMs
2020-01-25 07:45:59,565-03 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '3a5968e2-ee7b-41c5-8bc5-4568c08e858b'
2020-01-25 07:45:59,565-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] No VMs available for prestarting
2020-01-25 07:45:59,661-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '8f86a8ab-edc9-417b-bfa3-5bada5818191' is missing 2 prestarted VMs, attempting to prestart 2 VMs
2020-01-25 07:45:59,661-03 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '8f86a8ab-edc9-417b-bfa3-5bada5818191'
~~~~

The customer has disabled the DWH and CPUs are at 16.
~~~~
cat cpuinfo | grep processor | wc -l
16

https://gss--c.na94.visual.force.com/apex/Case_View?id=5002K00000je8n4QAA&sfdc.override=1

Logs:
RHVM sosreport-20200127-195113
0020-engine_backup

I understand a fix is being sought after in BZ# 1666610. I like to know if the recommendation for "VdsRefreshRate to 10" can be applied here too try to reduce the query volume?

Comment 5 Scott Dickerson 2020-06-18 14:13:05 UTC

As a first step in dealing with VM Portal generating a lot of REST calls, a few changes have been made:
  https://github.com/oVirt/ovirt-web-ui/pull/1238

Comment 8 Michal Skrivanek 2020-07-07 12:29:28 UTC

For webadmin this is most likely bug 1845747 which has been fixed for 4.4 and 4.3.11

Comment 9 Michal Skrivanek 2020-07-07 12:33:18 UTC

for 4.4.1 there's a partial VM Portal fix too(https://github.com/oVirt/ovirt-web-ui/issues/1240). We haven't measured the difference, it could be significant and solve the problem already, or maybe not. We are targeting further improvement in 4.4.2 so we'll keep the bug open

Comment 18 David Vaanunu 2020-09-23 14:40:16 UTC

Tested version:

rhv-release-4.4.2-4
redhat-release-8.2-25.0
ovirt-engine-4.4.2.3-0.6
ovirt-web-ui-1.6.4-1



Flow:
Open 50 sessions of VM portal and scroll down.

After login to VM-Portal, 20 VMs are loading.
Each scroll down trigger to load another 20 VMs.

While tested on older version (4.4.1), all the VMs are loading till reached the end.


Results:
hosted-engine (16 cores & 32GB) usage: 95% CPU and 20GB RAM

engine usage: 7% cpu , 5GB Memory
postgress usage: 90% usage , 6.5GB memory

Comment 19 mlehrer 2020-09-23 14:59:22 UTC

Just adding to the previous comment 18

The amount of vms requested is reduced by the dev fix but the main issue is the following sql query:  select * from getdisksvmguid which takes about 19s and is executed 200 times from the single api call of '/ovirt-engine/api/vms;max=100 follow=graphics_consoles' which comes from the vm portal.

Once issuing multiple instances of this query generated from multiple vm portal calls the PostgreSQL gets saturated by concurrent api requests from vm portal.

Comment 20 Michal Skrivanek 2020-09-24 12:13:21 UTC

we should eliminate the whole follow=graphics_consoles, it's whole reason for existence is just the spice/vnc console selection and that would be best to simplify into no choice, just select one internally (or via system/user settings)

that would take care of both the getdisksvmguid query but also an extra API call for each VM

Comment 21 Sharon Gratch 2020-09-29 14:55:57 UTC

(In reply to Michal Skrivanek from comment #20)
> we should eliminate the whole follow=graphics_consoles, it's whole reason
> for existence is just the spice/vnc console selection and that would be best
> to simplify into no choice, just select one internally (or via system/user
> settings)
> 
> that would take care of both the getdisksvmguid query but also an extra API
> call for each VM

I'm re-targetting this for 4.4.4 since we won't have time for completing that for 4.4.3.

Comment 24 mlehrer 2020-11-29 11:21:47 UTC

Target milestone is set to 4.4.4 is this still accurate?

Comment 25 Sandro Bonazzola 2020-12-18 15:47:44 UTC

This bug is in NEW status for ovirt 4.4.4. We are now in blocker only phase, please either mark this as a blocker or please re-target.

Comment 26 Sharon Gratch 2020-12-28 16:51:14 UTC

(In reply to Sandro Bonazzola from comment #25)
> This bug is in NEW status for ovirt 4.4.4. We are now in blocker only phase,
> please either mark this as a blocker or please re-target.

re-targeted

Comment 27 Sharon Gratch 2020-12-28 16:53:36 UTC

(In reply to mlehrer from comment #24)
> Target milestone is set to 4.4.4 is this still accurate?

This is still in-progress so postponed to 4.4.5.

Comment 37 mlehrer 2021-04-08 10:00:22 UTC

#Summary 
50 Users continually scrolling concurrently creates a moderate load, but PostgreSQL in addition to overall cpu utilization is reduced from previous version testing.


#How was this tested
Puppeteer script simulating 50 users with each a browsers logs into vm portal and continues scrolling down the page every few seconds.
Users are loaded every few seconds and continue to scroll down the page.
Backed actions taking over 1 second are collected by Glowroot, resources monitored by Nmon
Basic Webadmin functionality was checked during peak 50 user scroll load scenario.

#Env
4215 Vms and 260 Hosts
rhv-release-4.4.5-11-001.noarch
ovirt-web-ui-1.6.8-1.el8ev.noarch

#Findings 
Reduction in postgresql cpu utilization by about 30% less when compared to previous test
Overall engine cpu utilization reduced, including average run queue length reduced in half when compared to previous test
Webadmin actions may degrade by a few seconds during peak load

Moving to verified.

Comment 42 errata-xmlrpc 2021-04-14 11:43:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] 0-day security, bug fix, enhance), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1186

Note You need to log in before you can comment on or make changes to this bug.