Description of problem: RHV API VMs with follows takes too long: 60s to return 500 results. Impacts Postgresql health and engine stability. Example use case: MTV connects to RHV via API to create an inventory of VMs. A series of queries is issued each one taking around 60s in total. Transaction name: GET /ovirt-engine/api/vms/ Start: 2021-08-11 7:19:45.977 am (-05:00) Duration: 60,267.6 milliseconds Request http method: GET Request query string: follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+2 Version-Release number of selected component (if applicable): 4.4.7.4-0.9 How reproducible: very Steps to Reproduce: 1. Load 4000 Vms + 250 hosts 2. Issue GET /ovirt-engine/api/vms/follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+2 3. Repeat 4 times Actual results: The request takes 60s of which 58.5s is spent on 6,514 postgres queries. Problematic query is select * from getdisksvmguid(?, ?, ?, ?) Which is executed 500 times taking 103ms per execution returning only 1 row. Expected results: API result returned in 1.5s max Additional info:
Since this is really a query issue I'm attaching the db dump of the environment to allow for dev to reproduce the issue. https://drive.google.com/drive/folders/1A0uA3WNymWvwUZAvFdDY1lLEdTZRUJle?usp=sharing
Marking severity to HIGH due to impact on postgres and thus the engine itself.
is it possible to test without disk_attachments, and without snapshots, and without both? We suspect the disks query to be the only one that's so heavy.
No disk_attachment 69s No snapshots 7.5s No Disk & No snapshot: 7s
Verify with RHV-4.4.9-4 using RHV-RED-01 , ~6300VMs [root@rhev-red-01 ~]# time curl -i -k -o /root/bugverify1.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+1" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6368k 0 6368k 0 0 485k 0 --:--:-- 0:00:13 --:--:-- 1634k real 0m13.113s user 0m0.028s sys 0m0.018s [root@rhev-red-01 ~]# time curl -i -k -o /root/bugverify2.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+2" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6113k 0 6113k 0 0 441k 0 --:--:-- 0:00:13 --:--:-- 1318k real 0m13.858s user 0m0.026s sys 0m0.018s [root@rhev-red-01 ~]# time curl -i -k -o /root/bugverify3.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+3" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6120k 0 6120k 0 0 459k 0 --:--:-- 0:00:13 --:--:-- 1489k real 0m13.329s user 0m0.023s sys 0m0.021s [root@rhev-red-01 ~]# time curl -i -k -o /root/bugverify4.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+4" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6093k 0 6093k 0 0 448k 0 --:--:-- 0:00:13 --:--:-- 1388k real 0m13.617s user 0m0.029s sys 0m0.016s [root@rhev-red-01 ~]# time curl -i -k -o /root/bugverify5.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics&max=500&search=page+5" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6081k 0 6081k 0 0 457k 0 --:--:-- 0:00:13 --:--:-- 1489k real 0m13.301s user 0m0.023s sys 0m0.022s Attached queries output "bugverify.tar.gz" file
(In reply to David Vaanunu from comment #12) > Verify with RHV-4.4.9-4 > using RHV-RED-01 , ~6300VMs Can you kindly please also check how much time it takes to query all VMs (without paging) on that env?
~6500 Vms , 02:40 minutes [root@rhev-red-01 ~]# time curl -i -k -o /root/allvms.log -X GET -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:passw0rd "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2Cnics" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 76.1M 0 76.1M 0 0 486k 0 --:--:-- 0:02:40 --:--:-- 19.0M real 2m40.377s user 0m0.200s sys 0m0.184s [root@rhev-red-01 ~]# grep "<name>" allvms.log | egrep -v "Etc|nic" | wc -l 6539
(In reply to David Vaanunu from comment #15) > ~6500 Vms , 02:40 minutes > > > > [root@rhev-red-01 ~]# time curl -i -k -o /root/allvms.log -X GET -H > "Accept: application/xml" -H "Content-type: application/xml" -u > admin@internal:passw0rd > "https://rhev-red-01.rdu2.scalelab.redhat.com/ovirt-engine/api/ > vms?follow=disk_attachments%2Chost_devices%2Csnapshots%2Cwatchdogs%2Ccdroms%2 > Cnics" > % Total % Received % Xferd Average Speed Time Time Time > Current > Dload Upload Total Spent Left Speed > 100 76.1M 0 76.1M 0 0 486k 0 --:--:-- 0:02:40 --:--:-- > 19.0M > > real 2m40.377s > user 0m0.200s > sys 0m0.184s > > > [root@rhev-red-01 ~]# grep "<name>" allvms.log | egrep -v "Etc|nic" | wc -l > 6539 Thanks, so we have 14 pages and the last page contains only 39 VMs Assuming it takes ~13s to query each page as shown before, I'd expect that querying all the VM using pages of 500 VMs to take a bit more than 13s*13 = 169s = 2m49s Interestingly, querying all the VMs at once doesn't improve the performance (so I guess initiating the connection on the test environment takes less time than it took on my environment) - it remains sort of the same That above is the not-that-good news but the good news are that with those enhancements RHV manages to report a significant amount of VMs in a reasonable time Jeff, I think it is worth a shot to drop paging, at least as from RHV 4.4.9
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days