1992686 – [Scale] Slow performing vm api follow query impacting engine health

Bug 1992686 - [Scale] Slow performing vm api follow query impacting engine health

Summary: [Scale] Slow performing vm api follow query impacting engine health

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	RestAPI
Sub Component:
Version:	4.4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.9
Target Release:	---
Assignee:	Arik
QA Contact:	David Vaanunu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-11 14:54 UTC by mlehrer
Modified:	2023-09-15 01:13 UTC (History)
CC List:	12 users (show)
Fixed In Version:	ovirt-engine-4.4.9-1
Clone Of:
Environment:
Last Closed:	2021-10-21 07:27:14 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+ pm-rhel: devel_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43004	None	None	None	2021-08-11 14:58:37 UTC
oVirt gerrit	116656	master	MERGED	rest-api: accelerate /api/vms?follow=cdroms	2021-10-05 15:43:04 UTC
oVirt gerrit	116657	master	MERGED	rest-api: accelerate /api/vms?follow=host_devices	2021-10-06 16:49:27 UTC
oVirt gerrit	116658	master	MERGED	rest-api: accelerate /api/vms?follow=watchdogs	2021-10-06 16:49:30 UTC
oVirt gerrit	116659	master	MERGED	rest-api: accelerate /api/vms?follow=snapshots	2021-10-06 16:49:32 UTC
oVirt gerrit	116979	ovirt-engine-4.4	MERGED	rest-api: accelerate /api/vms?follow=cdroms	2021-10-06 09:37:56 UTC
oVirt gerrit	117006	ovirt-engine-4.4	MERGED	rest-api: accelerate /api/vms?follow=host_devices	2021-10-06 19:11:59 UTC
oVirt gerrit	117007	ovirt-engine-4.4	MERGED	rest-api: accelerate /api/vms?follow=watchdogs	2021-10-06 19:12:02 UTC
oVirt gerrit	117008	ovirt-engine-4.4	MERGED	rest-api: accelerate /api/vms?follow=snapshots	2021-10-06 19:12:12 UTC

Description mlehrer 2021-08-11 14:54:35 UTC

Description of problem:
RHV API VMs with follows takes too long: 60s to return 500 results.
Impacts Postgresql health and engine stability.


Example use case:
MTV connects to RHV via API to create an inventory of VMs.
A series of queries is issued each one taking around 60s in total.


Transaction name: 	GET /ovirt-engine/api/vms/
Start: 	2021-08-11 7:19:45.977 am (-05:00)
Duration: 	60,267.6 milliseconds
Request http method: 	GET
Request query string: 	follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+2



Version-Release number of selected component (if applicable):
4.4.7.4-0.9

How reproducible:
very

Steps to Reproduce:
1. Load 4000 Vms + 250 hosts
2. Issue GET /ovirt-engine/api/vms/follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+2
3. Repeat 4 times

Actual results:
The request takes 60s of which 58.5s is spent on 6,514 postgres queries.

Problematic query is
select * from getdisksvmguid(?, ?, ?, ?)
Which is executed 500 times taking 103ms per execution returning only 1 row.

Expected results:
API result returned in 1.5s max

Additional info:

Comment 1 mlehrer 2021-08-11 15:01:50 UTC

Since this is really a query issue I'm attaching the db dump of the environment to allow for dev to reproduce the issue.
https://drive.google.com/drive/folders/1A0uA3WNymWvwUZAvFdDY1lLEdTZRUJle?usp=sharing

Comment 2 mlehrer 2021-08-11 15:06:01 UTC

Marking severity to HIGH due to impact on postgres and thus the engine itself.

Comment 6 Michal Skrivanek 2021-08-24 12:40:46 UTC

is it possible to test without disk_attachments, and without snapshots, and without both?
We suspect the disks query to be the only one that's so heavy.

Comment 7 mlehrer 2021-08-30 11:39:19 UTC

No disk_attachment 69s
No snapshots 7.5s
No Disk & No snapshot: 7s

Comment 12 David Vaanunu 2021-10-12 15:44:06 UTC

Verify with RHV-4.4.9-4
using RHV-RED-01 , ~6300VMs



[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify1.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+1"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6368k    0 6368k    0     0   485k      0 --:--:--  0:00:13 --:--:-- 1634k

real	0m13.113s
user	0m0.028s
sys	0m0.018s
[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify2.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+2"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6113k    0 6113k    0     0   441k      0 --:--:--  0:00:13 --:--:-- 1318k

real	0m13.858s
user	0m0.026s
sys	0m0.018s
[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify3.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+3"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6120k    0 6120k    0     0   459k      0 --:--:--  0:00:13 --:--:-- 1489k

real	0m13.329s
user	0m0.023s
sys	0m0.021s
[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify4.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+4"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6093k    0 6093k    0     0   448k      0 --:--:--  0:00:13 --:--:-- 1388k

real	0m13.617s
user	0m0.029s
sys	0m0.016s
[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify5.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+5"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6081k    0 6081k    0     0   457k      0 --:--:--  0:00:13 --:--:-- 1489k

real	0m13.301s
user	0m0.023s
sys	0m0.022s


Attached queries output "bugverify.tar.gz" file

Comment 14 Arik 2021-10-12 17:49:13 UTC

(In reply to David Vaanunu from comment #12)
> Verify with RHV-4.4.9-4
> using RHV-RED-01 , ~6300VMs

Can you kindly please also check how much time it takes to query all VMs (without paging) on that env?

Comment 15 David Vaanunu 2021-10-13 10:46:33 UTC

~6500 Vms , 02:40 minutes



[root@rhev-red-01 ~]# time  curl -i -k -o /root/allvms.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 76.1M    0 76.1M    0     0   486k      0 --:--:--  0:02:40 --:--:-- 19.0M

real	2m40.377s
user	0m0.200s
sys	0m0.184s


[root@rhev-red-01 ~]# grep "<name>" allvms.log  | egrep -v "Etc|nic" | wc -l
6539

Comment 16 Arik 2021-10-13 19:39:43 UTC

(In reply to David Vaanunu from comment #15)
> ~6500 Vms , 02:40 minutes
> 
> 
> 
> [root@rhev-red-01 ~]# time  curl -i -k -o /root/allvms.log -X GET -H
> "Accept: application/xml" -H "Content-type: application/xml" -u
> admin@internal:passw0rd
> "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/
> vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2
> Cnics"
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time 
> Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 76.1M    0 76.1M    0     0   486k      0 --:--:--  0:02:40 --:--:--
> 19.0M
> 
> real	2m40.377s
> user	0m0.200s
> sys	0m0.184s
> 
> 
> [root@rhev-red-01 ~]# grep "<name>" allvms.log  | egrep -v "Etc|nic" | wc -l
> 6539

Thanks, so we have 14 pages and the last page contains only 39 VMs
Assuming it takes ~13s to query each page as shown before, I'd expect that querying all the VM using pages of 500 VMs to take a bit more than 13s*13 = 169s = 2m49s
Interestingly, querying all the VMs at once doesn't improve the performance (so I guess initiating the connection on the test environment takes less time than it took on my environment) - it remains sort of the same
That above is the not-that-good news but the good news are that with those enhancements RHV manages to report a significant amount of VMs in a reasonable time
Jeff, I think it is worth a shot to drop paging, at least as from RHV 4.4.9

Comment 17 Sandro Bonazzola 2021-10-21 07:27:14 UTC

This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 18 Red Hat Bugzilla 2023-09-15 01:13:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.