Bug 1795457

Summary: RHV-M causing high load on PostgreSQL DB after upgrade to 4.2
Product: Red Hat Enterprise Virtualization Manager Reporter: hhaberma
Component: ovirt-web-uiAssignee: Ben Amsalem <bamsalem>
Status: CLOSED ERRATA QA Contact: David Vaanunu <dvaanunu>
Severity: high Docs Contact:
Priority: high    
Version: 4.2.8CC: achareka, bamsalem, dagur, fgarciad, lleistne, michal.skrivanek, mlehrer, nashok, pelauter, sdickers, sgratch
Target Milestone: ovirt-4.4.5-1Keywords: Performance
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-web-ui-1.6.8-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-14 11:43:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: UX RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description hhaberma 2020-01-28 02:00:42 UTC
Description of problem:

Customer had recently upgrading to RHVM 4.2 from 4.1. After upgrade the customer observed many postgresql processes after upgrade and CPU Load was abnormally triggered to 60 percent.

Currently, the customer was able to get performance back on manager after disabling vacuum option in postgresql.conf and stopping DWH with manager. Both of these were most of postgresql processes observed after upgrade.

In the engine log, the same signature in https://bugzilla.redhat.com/show_bug.cgi?id=1666610 is found happening It's happening several times a second.

~~~~
2020-01-25 07:45:59,521-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '2a253524-3e07-4b00-ae6c-742f58ad9126' is missing 2 prestarted VMs, attempting to prestart 2 VMs
2020-01-25 07:45:59,522-03 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '2a253524-3e07-4b00-ae6c-742f58ad9126'
2020-01-25 07:45:59,522-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] No VMs available for prestarting
2020-01-25 07:45:59,564-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '3a5968e2-ee7b-41c5-8bc5-4568c08e858b' is missing 1 prestarted VMs, attempting to prestart 1 VMs
2020-01-25 07:45:59,565-03 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '3a5968e2-ee7b-41c5-8bc5-4568c08e858b'
2020-01-25 07:45:59,565-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] No VMs available for prestarting
2020-01-25 07:45:59,661-03 INFO  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '8f86a8ab-edc9-417b-bfa3-5bada5818191' is missing 2 prestarted VMs, attempting to prestart 2 VMs
2020-01-25 07:45:59,661-03 WARN  [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '8f86a8ab-edc9-417b-bfa3-5bada5818191'
~~~~

The customer has disabled the DWH and CPUs are at 16.
~~~~
cat cpuinfo | grep processor | wc -l
16

https://gss--c.na94.visual.force.com/apex/Case_View?id=5002K00000je8n4QAA&sfdc.override=1

Logs:
RHVM sosreport-20200127-195113
0020-engine_backup

I understand a fix is being sought after in BZ# 1666610. I like to know if the recommendation for "VdsRefreshRate to 10" can be applied here too try to reduce the query volume?

Comment 5 Scott Dickerson 2020-06-18 14:13:05 UTC
As a first step in dealing with VM Portal generating a lot of REST calls, a few changes have been made:
  https://github.com/oVirt/ovirt-web-ui/pull/1238

Comment 8 Michal Skrivanek 2020-07-07 12:29:28 UTC
For webadmin this is most likely bug 1845747 which has been fixed for 4.4 and 4.3.11

Comment 9 Michal Skrivanek 2020-07-07 12:33:18 UTC
for 4.4.1 there's a partial VM Portal fix too(https://github.com/oVirt/ovirt-web-ui/issues/1240). We haven't measured the difference, it could be significant and solve the problem already, or maybe not. We are targeting further improvement in 4.4.2 so we'll keep the bug open

Comment 18 David Vaanunu 2020-09-23 14:40:16 UTC
Tested version:

rhv-release-4.4.2-4
redhat-release-8.2-25.0
ovirt-engine-4.4.2.3-0.6
ovirt-web-ui-1.6.4-1



Flow:
Open 50 sessions of VM portal and scroll down.

After login to VM-Portal, 20 VMs are loading.
Each scroll down trigger to load another 20 VMs.

While tested on older version (4.4.1), all the VMs are loading till reached the end.


Results:
hosted-engine (16 cores & 32GB) usage: 95% CPU and 20GB RAM

engine usage: 7% cpu , 5GB Memory
postgress usage: 90% usage , 6.5GB memory

Comment 19 mlehrer 2020-09-23 14:59:22 UTC
Just adding to the previous comment 18

The amount of vms requested is reduced by the dev fix but the main issue is the following sql query:  select * from getdisksvmguid which takes about 19s and is executed 200 times from the single api call of '/ovirt-engine/api/vms;max=100 follow=graphics_consoles' which comes from the vm portal.

Once issuing multiple instances of this query generated from multiple vm portal calls the PostgreSQL gets saturated by concurrent api requests from vm portal.

Comment 20 Michal Skrivanek 2020-09-24 12:13:21 UTC
we should eliminate the whole follow=graphics_consoles, it's whole reason for existence is just the spice/vnc console selection and that would be best to simplify into no choice, just select one internally (or via system/user settings)

that would take care of both the getdisksvmguid query but also an extra API call for each VM

Comment 21 Sharon Gratch 2020-09-29 14:55:57 UTC
(In reply to Michal Skrivanek from comment #20)
> we should eliminate the whole follow=graphics_consoles, it's whole reason
> for existence is just the spice/vnc console selection and that would be best
> to simplify into no choice, just select one internally (or via system/user
> settings)
> 
> that would take care of both the getdisksvmguid query but also an extra API
> call for each VM

I'm re-targetting this for 4.4.4 since we won't have time for completing that for 4.4.3.

Comment 24 mlehrer 2020-11-29 11:21:47 UTC
Target milestone is set to 4.4.4 is this still accurate?

Comment 25 Sandro Bonazzola 2020-12-18 15:47:44 UTC
This bug is in NEW status for ovirt 4.4.4. We are now in blocker only phase, please either mark this as a blocker or please re-target.

Comment 26 Sharon Gratch 2020-12-28 16:51:14 UTC
(In reply to Sandro Bonazzola from comment #25)
> This bug is in NEW status for ovirt 4.4.4. We are now in blocker only phase,
> please either mark this as a blocker or please re-target.

re-targeted

Comment 27 Sharon Gratch 2020-12-28 16:53:36 UTC
(In reply to mlehrer from comment #24)
> Target milestone is set to 4.4.4 is this still accurate?

This is still in-progress so postponed to 4.4.5.

Comment 37 mlehrer 2021-04-08 10:00:22 UTC
#Summary 
50 Users continually scrolling concurrently creates a moderate load, but PostgreSQL in addition to overall cpu utilization is reduced from previous version testing.


#How was this tested
Puppeteer script simulating 50 users with each a browsers logs into vm portal and continues scrolling down the page every few seconds.
Users are loaded every few seconds and continue to scroll down the page.
Backed actions taking over 1 second are collected by Glowroot, resources monitored by Nmon
Basic Webadmin functionality was checked during peak 50 user scroll load scenario.

#Env
4215 Vms and 260 Hosts
rhv-release-4.4.5-11-001.noarch
ovirt-web-ui-1.6.8-1.el8ev.noarch

#Findings 
Reduction in postgresql cpu utilization by about 30% less when compared to previous test
Overall engine cpu utilization reduced, including average run queue length reduced in half when compared to previous test
Webadmin actions may degrade by a few seconds during peak load

Moving to verified.

Comment 42 errata-xmlrpc 2021-04-14 11:43:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] 0-day security, bug fix, enhance), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1186