Description of problem: The following query takes more than 9 minutes to complete: SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM vms_with_tags LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id WHERE ( vms.status = '1' AND ( storage_domains_with_hosts_view.storage_comment ILIKE '%iscsi\_L1\_vms\_sd\_1%' OR storage_domains_with_hosts_view.storage_description ILIKE '%iscsi\_L1\_vms\_sd\_1%' OR storage_domains_with_hosts_view.storage_name ILIKE '%iscsi\_L1\_vms\_sd\_1%' OR storage_domains_with_hosts_view.storage_pool_name::text ILIKE '%iscsi\_L1\_vms\_sd\_1%' ) ))) ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100 Version-Release number of selected component (if applicable): ovirt engine 4.2.3.5-0.1.el7 #Topology Engine / Engine DB (1 machine) DWH DB (separate machine) 400 Hosts RHEL 7.5 3.10.0-862.2.3.el7.x86_64 ovirt-engine-4.2.3.5-0.1.el7.noarch How reproducible: 100% Steps to Reproduce: 1. Login to webadmin 2. compute->vms 3. in search box: status=up and storage=iscsi_L1_vms_sd_2 Actual results: The query takes forever to complete Expected results: Shouldnt take that long time Additional info: When making the following API request with the same query params: https://b01-h21-r620.rhev.openstack.engineering.redhat.com/ovirt-engine/api/vms;max=100?search=status%3Dup+and+storage%3Discsi_L1_vms_sd_1 Looking at postgres.log we can see the following (configured to log queries with > 10sec duration): 2018-05-23 11:48:01.340 UTC LOG: duration: 563122.424 ms execute <unnamed>: SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM vms_with_tags LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id WHERE ( vms.status = '1' AND ( storage_domains_with_hosts_view.storage_comment LIKE '%iscsi\_L1\_vms\_sd\_1%' OR storage_domains_with_hosts_view.storage_description LIKE '%iscsi\_L1\_vms\_sd\_1%' OR storage_domains_with_hosts_view.storage_name LIKE '%iscsi\_L1\_vms\_sd\_1%' OR storage_domains_with_hosts_view.storage_pool_name::text LIKE '%iscsi\_L1\_vms\_sd\_1%' ) ))) ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100
1. The title is wrong. Please change it to reflect the issue - search is taking too long. 2. Is the issue with the number of hosts? Or the number of... ? For example, you have >4K VMs and I assume at least a similar number of disks across multiple domains. 3. Can you provide more information on the setup - database, storage for it, Engine CPU, memory, ... Your database seems to be on a Netapp - is it fast? There's a single interface (10g) to it, so I assume a single active path with 2 sessions? Or are both active? I'm not sure your multipath.conf is configured for best practices for that Netapp. Is it SSD based?
(In reply to Yaniv Kaul from comment #2) > 1. The title is wrong. Please change it to reflect the issue - search is > taking too long. Changed > 2. Is the issue with the number of hosts? Or the number of... ? For example, > you have >4K VMs and I assume at least a similar number of disks across > multiple domains. > 3. Can you provide more information on the setup - database, storage for it, > Engine CPU, memory, ... > Your database seems to be on a Netapp - is it fast? There's a single > interface (10g) to it, so I assume a single active path with 2 sessions? Or > are both active? I'm not sure your multipath.conf is configured for best > practices for that Netapp. Is it SSD based? I think that the issue is really with that particular query i specified above. You can try making this REST GET call through any REST client and see that it takes forever: https://b01-h21-r620.rhev.openstack.engineering.redhat.com/ovirt-engine/api/vms;max=10?search=status%3Dup+and+storage%3Discsi_L1_vms_sd_1 Observe the latest pg log. in this case: /var/opt/rh/rh-postgresql95/lib/pgsql/data/pg_log/postgresql-Wed.log It got clogged with this query i specified above, although nothing is invoking it. Please note that i limit the query results to show ONLY 10 vms. Lowering the max value to 1 makes this query much faster. One more important point: In our env we got ~4K vms, but querying them with search params status=up (for example) takes ~1 Sec to complete. So the issue here is not retrieving thousands of results, but with this particular param in search called 'storage'. In other words, when combining any of search params with param called 'storage', we see the duration issue.
Pass criteria for this trivial search is under 1 sec even for 4k VMS. Similar to Bug 1578357, it is possibly that this search can avoid using tags (keep me honest here that indeed this is the case)
Please attach a dump of he databse for further investigation.
I'm trying to do explain enalyze on the following query and it takes 100c% cpu. Examining top output I don't see any io wait, just working cpus. Also vmstat looks ok. Counting the objects, I see ~ 4k entries in vms_with tags, and 400 in vds_with_tags. need to reproduce the same with lower counts on other systems and see what we get SELECT * FROM ( SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM vms_with_tags LEFT OUTER JOIN vds_with_tags ON vms_with_tags.run_on_vds=vds_with_tags.vds_id WHERE vms.vm_name LIKE '%L1%' AND ( ( vds_with_tags.cluster_name LIKE '%b01-h13%%' OR vds_with_tags.cpu_model LIKE '%b01-h13%%' OR vds_with_tags.free_text_comment LIKE '%b01-h13%%' OR vds_with_tags.host_name LIKE '%b01-h13%%' OR vds_with_tags.software_version LIKE '%b01-h13%%' OR vds_with_tags.storage_pool_name LIKE '%b01-h13%%' OR vds_with_tags.tag_name LIKE '%b01-h13%%' OR vds_with_tags.vds_name LIKE '%b01-h13%%' ) AND vms.vm_name LIKE '%L1%' ))) ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100;
(In reply to Roy Golan from comment #9) > I'm trying to do explain enalyze on the following query and it takes 100c% > cpu. Examining top output I don't see any io wait, just working cpus. Also > vmstat looks ok. Counting the objects, I see ~ 4k entries in vms_with tags, > and 400 in vds_with_tags. > > need to reproduce the same with lower counts on other systems and see what > we get The explain was re-run and the results can be found here: https://privatebin-it-iso.int.open.paas.redhat.com/?429636bf68822a2d#W7c7urbeuFAZOqzLfuMRCnYA7Gs28LeDpXQkboEQp6w= > > > SELECT * FROM > ( > SELECT * FROM vms WHERE > ( vm_guid IN > (SELECT distinct vms_with_tags.vm_guid FROM vms_with_tags > LEFT OUTER JOIN vds_with_tags ON > vms_with_tags.run_on_vds=vds_with_tags.vds_id > WHERE vms.vm_name LIKE '%L1%' > AND ( > ( > vds_with_tags.cluster_name LIKE '%b01-h13%%' > OR > vds_with_tags.cpu_model LIKE '%b01-h13%%' > OR > vds_with_tags.free_text_comment LIKE '%b01-h13%%' > OR > vds_with_tags.host_name LIKE '%b01-h13%%' > OR > vds_with_tags.software_version LIKE '%b01-h13%%' > OR > vds_with_tags.storage_pool_name LIKE '%b01-h13%%' > OR > vds_with_tags.tag_name LIKE '%b01-h13%%' > OR > vds_with_tags.vds_name LIKE '%b01-h13%%' > ) > AND > vms.vm_name LIKE '%L1%' ))) ORDER BY vm_name ASC > > > ) as T1 OFFSET (1 -1) LIMIT 100;
Created attachment 1448664 [details] explain analyze output Previous link shared is no longer valid, I've attached the output of explain as its own file here.
attaching the explain/analyze command output after full vaccum per Roy's request. explain_analize_after_vacum.txt The query: -=>>date; psql -t -l -U engine -c "EXPLAIN ANALYZE SELECT * FROM(SELECT * FROM vms WHERE( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM vms_with_tags LEFT OUTER JOIN vds_with_tags ON vms_with_tags.run_on_vds=vds_with_tags.vds_id WHERE vms.vm_name LIKE '%L1%' AND ((vds_with_tags.cluster_name LIKE '%b01-h13%%' OR vds_with_tags.cpu_model LIKE '%b01-h13%%' OR vds_with_tags.free_text_comment LIKE '%b01-h13%%' OR vds_with_tags.host_name LIKE '%b01-h13%%' OR vds_with_tags.software_version LIKE '%b01-h13%%' OR vds_with_tags.storage_pool_name LIKE '%b01-h13%%' OR vds_with_tags.tag_name LIKE '%b01-h13%%' OR vds_with_tags.vds_name LIKE '%b01-h13%%') AND vms.vm_name LIKE '%L1%' ))) ORDER BY vm_name ASC) as T1 OFFSET (1 -1) LIMIT 100;" > explain_analize.txt; date Start: Sun Jun 10 10:28:35 UTC 2018 End: Sun Jun 10 10:52:13 UTC 2018
Investigation : Generated query on VMS: status=up and storage=iscsi_L1_vms_sd_2 SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM vms_with_tags LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id WHERE (( vms.status = '1' AND storage_domains_with_hosts_view.storage_comment ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR storage_domains_with_hosts_view.storage_description ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR storage_domains_with_hosts_view.storage_name ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR storage_domains_with_hosts_view.storage_pool_name::text ILIKE '%iscsi\_L1\_vms\_sd\_2%' ) ))) ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100 The problem here is that the inner query is using vms.status = '1' instead of vms_with_tags.status = '1' since vms_with_tags is already in the FROM clause of the inner query , adding a huge nested join in scaled env (<vms size> * <vms_with_tags> size) for each row of the outer query on vms view This is a general problem that is not specific to the status field or to the VM entity, it will occur every time that we have a mix of fields that forces using the views that supports tags in the inner query with one or more conditions on the main entity fields (VM in that case) Suggested fixes : ================= 1) force using the entity view with the tag support for the netted query on the main entity fields, this will require to list all those fields specifically to direct the search engine to handle them differently , we have already this mechanism in [1] for example 2) Analyze the final query string, identifying if there is a nested query using tags and force all references to use the entity with tag support view in the nested query 3) Move all conditions related to the main entity out of the nested query , this way the query above will be : SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM vms_with_tags LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id WHERE (( storage_domains_with_hosts_view.storage_comment ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR storage_domains_with_hosts_view.storage_description ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR storage_domains_with_hosts_view.storage_name ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR storage_domains_with_hosts_view.storage_pool_name::text ILIKE '%iscsi\_L1\_vms\_sd\_2%' ) ))) AND vms.status = '1' ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100 Methos #1 PROS : clean CONS : static , new fields added to the search without this handling will still perform badly Method #2 PROS: dynamic , will fix the query and handle also new fields when added CONS: 1) More query processing (not critical since we use caching) 2) Ugly code Method #3 PROS: actually solves the problem from its root CONS: requires major changes in search mechanism Conclusion : Method #1 is clean but not dynamic and mid risky Method #2 is not clean and least risky Method #3 is the cleanest but most risky We have to decide which approach to take in order to solve these kind of bad-performing queries for all potential search engine queries. Martin, what do you think ? [1] https://github.com/oVirt/ovirt-engine/blob/ec7ca5babc1e5535fc8c17331e0b8975766ce1e4/backend/manager/modules/searchbackend/src/main/java/org/ovirt/engine/core/searchbackend/SearchObjectAutoCompleter.java#L57
Discussed offline, we have chosen method #1
Reproduce first on the folowing: Engine : red01 cluster : L0_Group_4,ISCSI LUN, 5TB , 6 hosts, VMs count : 1500 VMs version : ovirt-engine-4.4.2-4-001.noarch query : [ status = up and Storage = L0_Group_4 ] this operation took more than 10 min to complete Fix verification on the folowing: Engine : red01 cluster : L0_Group_4,ISCSI LUN, 5TB , 6 hosts, VMs count : 1500 VMs version : ovirt-engine-4.4.5.8-0.9.el8ev.noarch query : [ status = up and Storage = L0_Group_4 ] this operation took around 3 sec. from the UI to complete successfully
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.