Bug 1795457 - RHV-M causing high load on PostgreSQL DB after upgrade to 4.2
Summary: RHV-M causing high load on PostgreSQL DB after upgrade to 4.2
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-web-ui
Version: 4.2.8
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ovirt-4.4.5-1
: ---
Assignee: Ben Amsalem
QA Contact: David Vaanunu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-28 02:00 UTC by hhaberma
Modified: 2023-10-06 19:04 UTC (History)
11 users (show)

Fixed In Version: ovirt-web-ui-1.6.8-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-14 11:43:08 UTC
oVirt Team: UX
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-web-ui issues 1240 0 None closed Change VM/Pool fetch page size and optimize refresh scheduling 2021-01-31 11:39:32 UTC
Github oVirt ovirt-web-ui issues 1242 0 None closed Fix VM and Pool pagination handling and have infinite scroll use it properly 2021-01-31 11:39:32 UTC
Github oVirt ovirt-web-ui pull 1238 0 None closed Update fetch page size, refactor background refresh sagas 2021-01-31 11:39:33 UTC
Github oVirt ovirt-web-ui pull 1342 0 None open Change Consoles data fetching time - fetch Consoles data on demand instead of fetch on login 2021-01-31 11:39:33 UTC
Red Hat Knowledge Base (Solution) 4937011 0 None None None 2020-03-30 10:41:07 UTC
Red Hat Product Errata RHSA-2021:1186 0 None None None 2021-04-14 11:43:43 UTC
oVirt gerrit 112536 0 master MERGED packaging: Add new value ClientModeConsoleDefault to config DB table 2021-02-02 13:47:35 UTC
oVirt gerrit 114302 0 master MERGED restapi: retrieve current graphic consoles 2021-05-15 15:51:36 UTC

Description hhaberma 2020-01-28 02:00:42 UTC
Description of problem:

Customer had recently upgrading to RHVM 4.2 from 4.1. After upgrade the customer observed many postgresql processes after upgrade and CPU Load was abnormally triggered to 60 percent.

Currently, the customer was able to get performance back on manager after disabling vacuum option in postgresql.conf and stopping DWH with manager. Both of these were most of postgresql processes observed after upgrade.

In the engine log, the same signature in https://bugzilla.redhat.com/show_bug.cgi?id=1666610 is found happening It's happening several times a second.

~~~~
2020-01-25 07:45:59,521-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '2a253524-3e07-4b00-ae6c-742f58ad9126' is missing 2 prestarted VMs, attempting to prestart 2 VMs
2020-01-25 07:45:59,522-03 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '2a253524-3e07-4b00-ae6c-742f58ad9126'
2020-01-25 07:45:59,522-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] No VMs available for prestarting
2020-01-25 07:45:59,564-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '3a5968e2-ee7b-41c5-8bc5-4568c08e858b' is missing 1 prestarted VMs, attempting to prestart 1 VMs
2020-01-25 07:45:59,565-03 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '3a5968e2-ee7b-41c5-8bc5-4568c08e858b'
2020-01-25 07:45:59,565-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] No VMs available for prestarting
2020-01-25 07:45:59,661-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '8f86a8ab-edc9-417b-bfa3-5bada5818191' is missing 2 prestarted VMs, attempting to prestart 2 VMs
2020-01-25 07:45:59,661-03 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '8f86a8ab-edc9-417b-bfa3-5bada5818191'
~~~~

The customer has disabled the DWH and CPUs are at 16.
~~~~
cat cpuinfo | grep processor | wc -l
16

https://gss--c.na94.visual.force.com/apex/Case_View?id=5002K00000je8n4QAA&sfdc.override=1

Logs:
RHVM sosreport-20200127-195113
0020-engine_backup

I understand a fix is being sought after in BZ# 1666610. I like to know if the recommendation for "VdsRefreshRate to 10" can be applied here too try to reduce the query volume?

Comment 5 Scott Dickerson 2020-06-18 14:13:05 UTC
As a first step in dealing with VM Portal generating a lot of REST calls, a few changes have been made:
  https://github.com/oVirt/ovirt-web-ui/pull/1238

Comment 8 Michal Skrivanek 2020-07-07 12:29:28 UTC
For webadmin this is most likely bug 1845747 which has been fixed for 4.4 and 4.3.11

Comment 9 Michal Skrivanek 2020-07-07 12:33:18 UTC
for 4.4.1 there's a partial VM Portal fix too(https://github.com/oVirt/ovirt-web-ui/issues/1240). We haven't measured the difference, it could be significant and solve the problem already, or maybe not. We are targeting further improvement in 4.4.2 so we'll keep the bug open

Comment 18 David Vaanunu 2020-09-23 14:40:16 UTC
Tested version:

rhv-release-4.4.2-4
redhat-release-8.2-25.0
ovirt-engine-4.4.2.3-0.6
ovirt-web-ui-1.6.4-1



Flow:
Open 50 sessions of VM portal and scroll down.

After login to VM-Portal, 20 VMs are loading.
Each scroll down trigger to load another 20 VMs.

While tested on older version (4.4.1), all the VMs are loading till reached the end.


Results:
hosted-engine (16 cores & 32GB) usage: 95% CPU and 20GB RAM

engine usage: 7% cpu , 5GB Memory
postgress usage: 90% usage , 6.5GB memory

Comment 19 mlehrer 2020-09-23 14:59:22 UTC
Just adding to the previous comment 18

The amount of vms requested is reduced by the dev fix but the main issue is the following sql query:  select * from getdisksvmguid which takes about 19s and is executed 200 times from the single api call of '/ovirt-engine/api/vms;max=100 follow=graphics_consoles' which comes from the vm portal.

Once issuing multiple instances of this query generated from multiple vm portal calls the PostgreSQL gets saturated by concurrent api requests from vm portal.

Comment 20 Michal Skrivanek 2020-09-24 12:13:21 UTC
we should eliminate the whole follow=graphics_consoles, it's whole reason for existence is just the spice/vnc console selection and that would be best to simplify into no choice, just select one internally (or via system/user settings)

that would take care of both the getdisksvmguid query but also an extra API call for each VM

Comment 21 Sharon Gratch 2020-09-29 14:55:57 UTC
(In reply to Michal Skrivanek from comment #20)
> we should eliminate the whole follow=graphics_consoles, it's whole reason
> for existence is just the spice/vnc console selection and that would be best
> to simplify into no choice, just select one internally (or via system/user
> settings)
> 
> that would take care of both the getdisksvmguid query but also an extra API
> call for each VM

I'm re-targetting this for 4.4.4 since we won't have time for completing that for 4.4.3.

Comment 24 mlehrer 2020-11-29 11:21:47 UTC
Target milestone is set to 4.4.4 is this still accurate?

Comment 25 Sandro Bonazzola 2020-12-18 15:47:44 UTC
This bug is in NEW status for ovirt 4.4.4. We are now in blocker only phase, please either mark this as a blocker or please re-target.

Comment 26 Sharon Gratch 2020-12-28 16:51:14 UTC
(In reply to Sandro Bonazzola from comment #25)
> This bug is in NEW status for ovirt 4.4.4. We are now in blocker only phase,
> please either mark this as a blocker or please re-target.

re-targeted

Comment 27 Sharon Gratch 2020-12-28 16:53:36 UTC
(In reply to mlehrer from comment #24)
> Target milestone is set to 4.4.4 is this still accurate?

This is still in-progress so postponed to 4.4.5.

Comment 37 mlehrer 2021-04-08 10:00:22 UTC
#Summary 
50 Users continually scrolling concurrently creates a moderate load, but PostgreSQL in addition to overall cpu utilization is reduced from previous version testing.


#How was this tested
Puppeteer script simulating 50 users with each a browsers logs into vm portal and continues scrolling down the page every few seconds.
Users are loaded every few seconds and continue to scroll down the page.
Backed actions taking over 1 second are collected by Glowroot, resources monitored by Nmon
Basic Webadmin functionality was checked during peak 50 user scroll load scenario.

#Env
4215 Vms and 260 Hosts
rhv-release-4.4.5-11-001.noarch
ovirt-web-ui-1.6.8-1.el8ev.noarch

#Findings 
Reduction in postgresql cpu utilization by about 30% less when compared to previous test
Overall engine cpu utilization reduced, including average run queue length reduced in half when compared to previous test
Webadmin actions may degrade by a few seconds during peak load

Moving to verified.

Comment 42 errata-xmlrpc 2021-04-14 11:43:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] 0-day security, bug fix, enhance), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1186


Note You need to log in before you can comment on or make changes to this bug.