1581677 – [scale] search (VMs + storage domain) is taking too long

Bug 1581677 - [scale] search (VMs + storage domain) is taking too long

Summary: [scale] search (VMs + storage domain) is taking too long

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.2.3.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.5
Target Release:	4.4.5
Assignee:	Eli Mesika
QA Contact:	mlehrer
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-23 11:51 UTC by Ilan Zuckerman
Modified:	2021-03-22 12:55 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-engine-4.4.5
Clone Of:
Environment:
Last Closed:	2021-03-18 15:13:33 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.4+

Attachments	(Terms of Use)
explain analyze output (45.07 KB, text/plain) 2018-06-07 09:51 UTC, mlehrer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	112551	0	master	MERGED	use tags views in inner queries when needed	2021-01-31 11:39:33 UTC

Description Ilan Zuckerman 2018-05-23 11:51:09 UTC

Description of problem:

The following query takes more than 9 minutes to complete:
SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags   LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id    WHERE (  vms.status = '1'  AND  (  storage_domains_with_hosts_view.storage_comment ILIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_description ILIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_name ILIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_pool_name::text ILIKE '%iscsi\_L1\_vms\_sd\_1%' )  )))  ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100

Version-Release number of selected component (if applicable):
ovirt engine 4.2.3.5-0.1.el7

#Topology
Engine / Engine DB (1 machine)
DWH DB (separate machine)
400 Hosts
RHEL 7.5 3.10.0-862.2.3.el7.x86_64
ovirt-engine-4.2.3.5-0.1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Login to webadmin
2. compute->vms
3. in search box: status=up and storage=iscsi_L1_vms_sd_2

Actual results:
The query takes forever to complete

Expected results:
Shouldnt take that long time

Additional info:
When making the following API request with the same query params:

https://b01-h21-r620.rhev.openstack.engineering.redhat.com/ovirt-engine/api/vms;max=100?search=status%3Dup+and+storage%3Discsi_L1_vms_sd_1

Looking at postgres.log we can see the following (configured to log queries with > 10sec duration):

2018-05-23 11:48:01.340 UTC LOG:  duration: 563122.424 ms  execute <unnamed>: SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags   LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id    WHERE (  vms.status = '1'  AND  (  storage_domains_with_hosts_view.storage_comment LIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_description LIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_name LIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_pool_name::text LIKE '%iscsi\_L1\_vms\_sd\_1%' )  )))  ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100

Comment 2 Yaniv Kaul 2018-05-24 06:49:58 UTC

1. The title is wrong. Please change it to reflect the issue - search is taking too long.
2. Is the issue with the number of hosts? Or the number of... ? For example, you have >4K VMs and I assume at least a similar number of disks across multiple domains.
3. Can you provide more information on the setup - database, storage for it, Engine CPU, memory, ...
Your database seems to be on a Netapp - is it fast? There's a single interface (10g) to it, so I assume a single active path with 2 sessions? Or are both active? I'm not sure your multipath.conf is configured for best practices for that Netapp. Is it SSD based?

Comment 3 Ilan Zuckerman 2018-05-24 07:39:18 UTC

(In reply to Yaniv Kaul from comment #2)
> 1. The title is wrong. Please change it to reflect the issue - search is
> taking too long.
Changed

> 2. Is the issue with the number of hosts? Or the number of... ? For example,
> you have >4K VMs and I assume at least a similar number of disks across
> multiple domains.
> 3. Can you provide more information on the setup - database, storage for it,
> Engine CPU, memory, ...
> Your database seems to be on a Netapp - is it fast? There's a single
> interface (10g) to it, so I assume a single active path with 2 sessions? Or
> are both active? I'm not sure your multipath.conf is configured for best
> practices for that Netapp. Is it SSD based?

I think that the issue is really with that particular query i specified above. You can try making this REST GET call through any REST client and see that it takes forever:
https://b01-h21-r620.rhev.openstack.engineering.redhat.com/ovirt-engine/api/vms;max=10?search=status%3Dup+and+storage%3Discsi_L1_vms_sd_1

Observe the latest pg log. in this case:
/var/opt/rh/rh-postgresql95/lib/pgsql/data/pg_log/postgresql-Wed.log
It got clogged with this query i specified above, although nothing is invoking it.

Please note that i limit the query results to show ONLY 10 vms.
Lowering the max value to 1 makes this query much faster.

One more important point: 
In our env we got ~4K vms, but querying them with search params status=up (for example) takes ~1 Sec to complete. So the issue here is not retrieving thousands of results, but with this particular param in search called 'storage'. In other words, when combining any of search params with param called 'storage', we see the duration issue.

Comment 5 Roy Golan 2018-05-27 05:37:26 UTC

Pass criteria for this trivial search is under 1 sec even for 4k VMS.

Similar to Bug 1578357, it is possibly that this search can avoid using tags (keep me honest here that indeed this is the case)

Comment 7 Eli Mesika 2018-05-27 09:28:33 UTC

Please attach a dump of he databse for further investigation.

Comment 9 Roy Golan 2018-06-05 13:32:34 UTC

I'm trying to do explain enalyze on the following query and it takes 100c% cpu. Examining top output I don't see any io wait, just working cpus. Also vmstat looks ok. Counting the objects, I see ~ 4k entries in vms_with tags, and 400 in vds_with_tags.

need to reproduce the same with lower counts on other systems and see what we get


SELECT * FROM
(
SELECT * FROM vms WHERE
( vm_guid IN 
(SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags
   LEFT OUTER JOIN vds_with_tags ON vms_with_tags.run_on_vds=vds_with_tags.vds_id
   WHERE vms.vm_name LIKE '%L1%'
   AND (
  (
  vds_with_tags.cluster_name LIKE '%b01-h13%%'
        OR
  vds_with_tags.cpu_model LIKE '%b01-h13%%'
        OR
  vds_with_tags.free_text_comment LIKE '%b01-h13%%'
        OR
  vds_with_tags.host_name LIKE '%b01-h13%%'
        OR
  vds_with_tags.software_version LIKE '%b01-h13%%'
        OR
  vds_with_tags.storage_pool_name LIKE '%b01-h13%%'
        OR
  vds_with_tags.tag_name LIKE '%b01-h13%%'
        OR
  vds_with_tags.vds_name LIKE '%b01-h13%%'
 )
  AND
 vms.vm_name LIKE '%L1%'  )))  ORDER BY vm_name ASC


) as T1 OFFSET (1 -1) LIMIT 100;

Comment 10 mlehrer 2018-06-06 17:03:17 UTC

(In reply to Roy Golan from comment #9)
> I'm trying to do explain enalyze on the following query and it takes 100c%
> cpu. Examining top output I don't see any io wait, just working cpus. Also
> vmstat looks ok. Counting the objects, I see ~ 4k entries in vms_with tags,
> and 400 in vds_with_tags.
> 
> need to reproduce the same with lower counts on other systems and see what
> we get

The explain was re-run and the results can be found here:
 https://privatebin-it-iso.int.open.paas.redhat.com/?429636bf68822a2d#W7c7urbeuFAZOqzLfuMRCnYA7Gs28LeDpXQkboEQp6w=



> 
> 
> SELECT * FROM
> (
> SELECT * FROM vms WHERE
> ( vm_guid IN 
> (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags
>    LEFT OUTER JOIN vds_with_tags ON
> vms_with_tags.run_on_vds=vds_with_tags.vds_id
>    WHERE vms.vm_name LIKE '%L1%'
>    AND (
>   (
>   vds_with_tags.cluster_name LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.cpu_model LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.free_text_comment LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.host_name LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.software_version LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.storage_pool_name LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.tag_name LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.vds_name LIKE '%b01-h13%%'
>  )
>   AND
>  vms.vm_name LIKE '%L1%'  )))  ORDER BY vm_name ASC
> 
> 
> ) as T1 OFFSET (1 -1) LIMIT 100;

Comment 11 mlehrer 2018-06-07 09:51:47 UTC

Created attachment 1448664 [details]
explain analyze output

Previous link shared is no longer valid, I've attached the output of explain as its own file here.

Comment 12 Ilan Zuckerman 2018-06-10 11:51:16 UTC

attaching the explain/analyze command output after full vaccum per Roy's request.
explain_analize_after_vacum.txt

The query:

-=>>date; psql -t -l -U engine -c "EXPLAIN ANALYZE SELECT * FROM(SELECT * FROM vms WHERE( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags LEFT OUTER JOIN vds_with_tags ON vms_with_tags.run_on_vds=vds_with_tags.vds_id WHERE vms.vm_name LIKE '%L1%' AND ((vds_with_tags.cluster_name LIKE '%b01-h13%%' OR vds_with_tags.cpu_model LIKE '%b01-h13%%' OR vds_with_tags.free_text_comment LIKE '%b01-h13%%' OR vds_with_tags.host_name LIKE '%b01-h13%%' OR vds_with_tags.software_version LIKE '%b01-h13%%' OR vds_with_tags.storage_pool_name LIKE '%b01-h13%%' OR vds_with_tags.tag_name LIKE '%b01-h13%%' OR vds_with_tags.vds_name LIKE '%b01-h13%%') AND vms.vm_name LIKE '%L1%'  ))) ORDER BY vm_name ASC) as T1 OFFSET (1 -1) LIMIT 100;" > explain_analize.txt; date

Start: Sun Jun 10 10:28:35 UTC 2018
End: Sun Jun 10 10:52:13 UTC 2018

Comment 14 Eli Mesika 2020-12-02 01:32:56 UTC

Investigation :

Generated query on 

VMS: status=up and storage=iscsi_L1_vms_sd_2

SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags   LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id    WHERE ((  vms.status = '1' AND storage_domains_with_hosts_view.storage_comment ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_description ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_name ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_pool_name::text ILIKE '%iscsi\_L1\_vms\_sd\_2%' )  ))) ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100

The problem here is that the inner query is using vms.status = '1' instead of vms_with_tags.status = '1' since vms_with_tags is already in the FROM clause of the inner query , adding a huge nested join in scaled env (<vms size> * <vms_with_tags> size) for each row of the outer query on vms view 

This is a general problem that is not specific to the status field or to the VM entity, it will occur every time that we have a mix of fields that forces using the views that supports tags in the inner query with one or more conditions on the main entity fields (VM in that case)


Suggested fixes :
================= 

1) force using the entity view with the tag support for the netted query on the main entity fields, this will require to list all those fields specifically to direct the search engine to handle them differently , we have already this mechanism in [1] for example 

2) Analyze the final query string, identifying if there is a nested query using tags and force all references to use the entity with tag support view in the nested query 

3) Move all conditions related to the main entity out of the nested query , this way the query above will be :

SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags   LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id    WHERE ((  storage_domains_with_hosts_view.storage_comment ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_description ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_name ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_pool_name::text ILIKE '%iscsi\_L1\_vms\_sd\_2%' )  ))) AND vms.status = '1' ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100


Methos #1 

PROS : clean 
CONS : static , new fields added to the search without this handling will still perform badly 

Method #2

PROS: dynamic , will fix the query and handle also new fields when added 
CONS: 1) More query processing (not critical since we use caching)
      2) Ugly code 
      
Method #3

PROS: actually solves the problem from its root 
CONS: requires major changes in search mechanism


Conclusion :


Method #1 is clean but not dynamic and mid risky 
Method #2 is not clean and least risky 
Method #3 is the cleanest but most risky 

We have to decide which approach to take in order to solve these kind of bad-performing queries for all potential search engine queries.

Martin, what do you think ?


[1] https://github.com/oVirt/ovirt-engine/blob/ec7ca5babc1e5535fc8c17331e0b8975766ce1e4/backend/manager/modules/searchbackend/src/main/java/org/ovirt/engine/core/searchbackend/SearchObjectAutoCompleter.java#L57

Comment 15 Martin Perina 2020-12-02 10:23:27 UTC

Discussed offline, we have chosen method #1

Comment 16 Tzahi Ashkenazi 2021-03-07 14:58:26 UTC

Reproduce first on the folowing:
Engine : red01
cluster : L0_Group_4,ISCSI LUN, 5TB , 6 hosts, VMs count : 1500 VMs 
version : ovirt-engine-4.4.2-4-001.noarch
query : [ status = up and Storage = L0_Group_4 ]
this operation took more than 10 min to complete 


Fix verification on the folowing:
Engine : red01
cluster : L0_Group_4,ISCSI LUN, 5TB , 6 hosts, VMs count : 1500 VMs 
version : ovirt-engine-4.4.5.8-0.9.el8ev.noarch
query : [ status = up and Storage = L0_Group_4 ]
this operation took around 3 sec. from the UI to complete successfully

Comment 17 Sandro Bonazzola 2021-03-18 15:13:33 UTC

This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 18 Sandro Bonazzola 2021-03-22 12:55:35 UTC

This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.