Bug 1992686 - [Scale] Slow performing vm api follow query impacting engine health
Summary: [Scale] Slow performing vm api follow query impacting engine health
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: RestAPI
Version: 4.4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.9
: ---
Assignee: Arik
QA Contact: David Vaanunu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-11 14:54 UTC by mlehrer
Modified: 2023-09-15 01:13 UTC (History)
12 users (show)

Fixed In Version: ovirt-engine-4.4.9-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-21 07:27:14 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+
pm-rhel: devel_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43004 0 None None None 2021-08-11 14:58:37 UTC
oVirt gerrit 116656 0 master MERGED rest-api: accelerate /api/vms?follow=cdroms 2021-10-05 15:43:04 UTC
oVirt gerrit 116657 0 master MERGED rest-api: accelerate /api/vms?follow=host_devices 2021-10-06 16:49:27 UTC
oVirt gerrit 116658 0 master MERGED rest-api: accelerate /api/vms?follow=watchdogs 2021-10-06 16:49:30 UTC
oVirt gerrit 116659 0 master MERGED rest-api: accelerate /api/vms?follow=snapshots 2021-10-06 16:49:32 UTC
oVirt gerrit 116979 0 ovirt-engine-4.4 MERGED rest-api: accelerate /api/vms?follow=cdroms 2021-10-06 09:37:56 UTC
oVirt gerrit 117006 0 ovirt-engine-4.4 MERGED rest-api: accelerate /api/vms?follow=host_devices 2021-10-06 19:11:59 UTC
oVirt gerrit 117007 0 ovirt-engine-4.4 MERGED rest-api: accelerate /api/vms?follow=watchdogs 2021-10-06 19:12:02 UTC
oVirt gerrit 117008 0 ovirt-engine-4.4 MERGED rest-api: accelerate /api/vms?follow=snapshots 2021-10-06 19:12:12 UTC

Description mlehrer 2021-08-11 14:54:35 UTC
Description of problem:
RHV API VMs with follows takes too long: 60s to return 500 results.
Impacts Postgresql health and engine stability.


Example use case:
MTV connects to RHV via API to create an inventory of VMs.
A series of queries is issued each one taking around 60s in total.


Transaction name: 	GET /ovirt-engine/api/vms/
Start: 	2021-08-11 7:19:45.977 am (-05:00)
Duration: 	60,267.6 milliseconds
Request http method: 	GET
Request query string: 	follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+2



Version-Release number of selected component (if applicable):
4.4.7.4-0.9

How reproducible:
very

Steps to Reproduce:
1. Load 4000 Vms + 250 hosts
2. Issue GET /ovirt-engine/api/vms/follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+2
3. Repeat 4 times

Actual results:
The request takes 60s of which 58.5s is spent on 6,514 postgres queries.

Problematic query is
select * from getdisksvmguid(?, ?, ?, ?)
Which is executed 500 times taking 103ms per execution returning only 1 row.

Expected results:
API result returned in 1.5s max

Additional info:

Comment 1 mlehrer 2021-08-11 15:01:50 UTC
Since this is really a query issue I'm attaching the db dump of the environment to allow for dev to reproduce the issue.
https://drive.google.com/drive/folders/1A0uA3WNymWvwUZAvFdDY1lLEdTZRUJle?usp=sharing

Comment 2 mlehrer 2021-08-11 15:06:01 UTC
Marking severity to HIGH due to impact on postgres and thus the engine itself.

Comment 6 Michal Skrivanek 2021-08-24 12:40:46 UTC
is it possible to test without disk_attachments, and without snapshots, and without both?
We suspect the disks query to be the only one that's so heavy.

Comment 7 mlehrer 2021-08-30 11:39:19 UTC
No disk_attachment 69s
No snapshots 7.5s
No Disk & No snapshot: 7s

Comment 12 David Vaanunu 2021-10-12 15:44:06 UTC
Verify with RHV-4.4.9-4
using RHV-RED-01 , ~6300VMs



[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify1.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+1"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6368k    0 6368k    0     0   485k      0 --:--:--  0:00:13 --:--:-- 1634k

real	0m13.113s
user	0m0.028s
sys	0m0.018s
[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify2.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+2"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6113k    0 6113k    0     0   441k      0 --:--:--  0:00:13 --:--:-- 1318k

real	0m13.858s
user	0m0.026s
sys	0m0.018s
[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify3.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+3"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6120k    0 6120k    0     0   459k      0 --:--:--  0:00:13 --:--:-- 1489k

real	0m13.329s
user	0m0.023s
sys	0m0.021s
[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify4.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+4"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6093k    0 6093k    0     0   448k      0 --:--:--  0:00:13 --:--:-- 1388k

real	0m13.617s
user	0m0.029s
sys	0m0.016s
[root@rhev-red-01 ~]# time  curl -i -k -o /root/bugverify5.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+5"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6081k    0 6081k    0     0   457k      0 --:--:--  0:00:13 --:--:-- 1489k

real	0m13.301s
user	0m0.023s
sys	0m0.022s


Attached queries output "bugverify.tar.gz" file

Comment 14 Arik 2021-10-12 17:49:13 UTC
(In reply to David Vaanunu from comment #12)
> Verify with RHV-4.4.9-4
> using RHV-RED-01 , ~6300VMs

Can you kindly please also check how much time it takes to query all VMs (without paging) on that env?

Comment 15 David Vaanunu 2021-10-13 10:46:33 UTC
~6500 Vms , 02:40 minutes



[root@rhev-red-01 ~]# time  curl -i -k -o /root/allvms.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 76.1M    0 76.1M    0     0   486k      0 --:--:--  0:02:40 --:--:-- 19.0M

real	2m40.377s
user	0m0.200s
sys	0m0.184s


[root@rhev-red-01 ~]# grep "<name>" allvms.log  | egrep -v "Etc|nic" | wc -l
6539

Comment 16 Arik 2021-10-13 19:39:43 UTC
(In reply to David Vaanunu from comment #15)
> ~6500 Vms , 02:40 minutes
> 
> 
> 
> [root@rhev-red-01 ~]# time  curl -i -k -o /root/allvms.log -X GET -H
> "Accept: application/xml" -H "Content-type: application/xml" -u
> admin@internal:passw0rd
> "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/
> vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2
> Cnics"
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time 
> Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 76.1M    0 76.1M    0     0   486k      0 --:--:--  0:02:40 --:--:--
> 19.0M
> 
> real	2m40.377s
> user	0m0.200s
> sys	0m0.184s
> 
> 
> [root@rhev-red-01 ~]# grep "<name>" allvms.log  | egrep -v "Etc|nic" | wc -l
> 6539

Thanks, so we have 14 pages and the last page contains only 39 VMs
Assuming it takes ~13s to query each page as shown before, I'd expect that querying all the VM using pages of 500 VMs to take a bit more than 13s*13 = 169s = 2m49s
Interestingly, querying all the VMs at once doesn't improve the performance (so I guess initiating the connection on the test environment takes less time than it took on my environment) - it remains sort of the same
That above is the not-that-good news but the good news are that with those enhancements RHV manages to report a significant amount of VMs in a reasonable time
Jeff, I think it is worth a shot to drop paging, at least as from RHV 4.4.9

Comment 17 Sandro Bonazzola 2021-10-21 07:27:14 UTC
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 18 Red Hat Bugzilla 2023-09-15 01:13:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.