Description of problem: Customer had recently upgrading to RHVM 4.2 from 4.1. After upgrade the customer observed many postgresql processes after upgrade and CPU Load was abnormally triggered to 60 percent. Currently, the customer was able to get performance back on manager after disabling vacuum option in postgresql.conf and stopping DWH with manager. Both of these were most of postgresql processes observed after upgrade. In the engine log, the same signature in https://bugzilla.redhat.com/show_bug.cgi?id=1666610 is found happening It's happening several times a second. ~~~~ 2020-01-25 07:45:59,521-03 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '2a253524-3e07-4b00-ae6c-742f58ad9126' is missing 2 prestarted VMs, attempting to prestart 2 VMs 2020-01-25 07:45:59,522-03 WARN [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '2a253524-3e07-4b00-ae6c-742f58ad9126' 2020-01-25 07:45:59,522-03 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] No VMs available for prestarting 2020-01-25 07:45:59,564-03 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '3a5968e2-ee7b-41c5-8bc5-4568c08e858b' is missing 1 prestarted VMs, attempting to prestart 1 VMs 2020-01-25 07:45:59,565-03 WARN [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '3a5968e2-ee7b-41c5-8bc5-4568c08e858b' 2020-01-25 07:45:59,565-03 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] No VMs available for prestarting 2020-01-25 07:45:59,661-03 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] VmPool '8f86a8ab-edc9-417b-bfa3-5bada5818191' is missing 2 prestarted VMs, attempting to prestart 2 VMs 2020-01-25 07:45:59,661-03 WARN [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler8) [1baca962] Failed to prestart any VMs for VmPool '8f86a8ab-edc9-417b-bfa3-5bada5818191' ~~~~ The customer has disabled the DWH and CPUs are at 16. ~~~~ cat cpuinfo | grep processor | wc -l 16 https://gss--c.na94.visual.force.com/apex/Case_View?id=5002K00000je8n4QAA&sfdc.override=1 Logs: RHVM sosreport-20200127-195113 0020-engine_backup I understand a fix is being sought after in BZ# 1666610. I like to know if the recommendation for "VdsRefreshRate to 10" can be applied here too try to reduce the query volume?
As a first step in dealing with VM Portal generating a lot of REST calls, a few changes have been made: https://github.com/oVirt/ovirt-web-ui/pull/1238
For webadmin this is most likely bug 1845747 which has been fixed for 4.4 and 4.3.11
for 4.4.1 there's a partial VM Portal fix too(https://github.com/oVirt/ovirt-web-ui/issues/1240). We haven't measured the difference, it could be significant and solve the problem already, or maybe not. We are targeting further improvement in 4.4.2 so we'll keep the bug open
Tested version: rhv-release-4.4.2-4 redhat-release-8.2-25.0 ovirt-engine-4.4.2.3-0.6 ovirt-web-ui-1.6.4-1 Flow: Open 50 sessions of VM portal and scroll down. After login to VM-Portal, 20 VMs are loading. Each scroll down trigger to load another 20 VMs. While tested on older version (4.4.1), all the VMs are loading till reached the end. Results: hosted-engine (16 cores & 32GB) usage: 95% CPU and 20GB RAM engine usage: 7% cpu , 5GB Memory postgress usage: 90% usage , 6.5GB memory
Just adding to the previous comment 18 The amount of vms requested is reduced by the dev fix but the main issue is the following sql query: select * from getdisksvmguid which takes about 19s and is executed 200 times from the single api call of '/ovirt-engine/api/vms;max=100 follow=graphics_consoles' which comes from the vm portal. Once issuing multiple instances of this query generated from multiple vm portal calls the PostgreSQL gets saturated by concurrent api requests from vm portal.
we should eliminate the whole follow=graphics_consoles, it's whole reason for existence is just the spice/vnc console selection and that would be best to simplify into no choice, just select one internally (or via system/user settings) that would take care of both the getdisksvmguid query but also an extra API call for each VM
(In reply to Michal Skrivanek from comment #20) > we should eliminate the whole follow=graphics_consoles, it's whole reason > for existence is just the spice/vnc console selection and that would be best > to simplify into no choice, just select one internally (or via system/user > settings) > > that would take care of both the getdisksvmguid query but also an extra API > call for each VM I'm re-targetting this for 4.4.4 since we won't have time for completing that for 4.4.3.
Target milestone is set to 4.4.4 is this still accurate?
This bug is in NEW status for ovirt 4.4.4. We are now in blocker only phase, please either mark this as a blocker or please re-target.
(In reply to Sandro Bonazzola from comment #25) > This bug is in NEW status for ovirt 4.4.4. We are now in blocker only phase, > please either mark this as a blocker or please re-target. re-targeted
(In reply to mlehrer from comment #24) > Target milestone is set to 4.4.4 is this still accurate? This is still in-progress so postponed to 4.4.5.
#Summary 50 Users continually scrolling concurrently creates a moderate load, but PostgreSQL in addition to overall cpu utilization is reduced from previous version testing. #How was this tested Puppeteer script simulating 50 users with each a browsers logs into vm portal and continues scrolling down the page every few seconds. Users are loaded every few seconds and continue to scroll down the page. Backed actions taking over 1 second are collected by Glowroot, resources monitored by Nmon Basic Webadmin functionality was checked during peak 50 user scroll load scenario. #Env 4215 Vms and 260 Hosts rhv-release-4.4.5-11-001.noarch ovirt-web-ui-1.6.8-1.el8ev.noarch #Findings Reduction in postgresql cpu utilization by about 30% less when compared to previous test Overall engine cpu utilization reduced, including average run queue length reduced in half when compared to previous test Webadmin actions may degrade by a few seconds during peak load Moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] 0-day security, bug fix, enhance), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1186