Bug 1581677 - [scale] search (VMs + storage domain) is taking too long
Summary: [scale] search (VMs + storage domain) is taking too long
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.2.3.5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.4.5
: 4.4.5
Assignee: Eli Mesika
QA Contact: mlehrer
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-23 11:51 UTC by Ilan Zuckerman
Modified: 2021-03-22 12:55 UTC (History)
8 users (show)

Fixed In Version: ovirt-engine-4.4.5
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-18 15:13:33 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.4+


Attachments (Terms of Use)
explain analyze output (45.07 KB, text/plain)
2018-06-07 09:51 UTC, mlehrer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 112551 0 master MERGED use tags views in inner queries when needed 2021-01-31 11:39:33 UTC

Description Ilan Zuckerman 2018-05-23 11:51:09 UTC
Description of problem:

The following query takes more than 9 minutes to complete:
SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags   LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id    WHERE (  vms.status = '1'  AND  (  storage_domains_with_hosts_view.storage_comment ILIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_description ILIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_name ILIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_pool_name::text ILIKE '%iscsi\_L1\_vms\_sd\_1%' )  )))  ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100

Version-Release number of selected component (if applicable):
ovirt engine 4.2.3.5-0.1.el7

#Topology
Engine / Engine DB (1 machine)
DWH DB (separate machine)
400 Hosts
RHEL 7.5 3.10.0-862.2.3.el7.x86_64
ovirt-engine-4.2.3.5-0.1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Login to webadmin
2. compute->vms
3. in search box: status=up and storage=iscsi_L1_vms_sd_2

Actual results:
The query takes forever to complete

Expected results:
Shouldnt take that long time

Additional info:
When making the following API request with the same query params:

https://b01-h21-r620.rhev.openstack.engineering.redhat.com/ovirt-engine/api/vms;max=100?search=status%3Dup+and+storage%3Discsi_L1_vms_sd_1

Looking at postgres.log we can see the following (configured to log queries with > 10sec duration):

2018-05-23 11:48:01.340 UTC LOG:  duration: 563122.424 ms  execute <unnamed>: SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags   LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id    WHERE (  vms.status = '1'  AND  (  storage_domains_with_hosts_view.storage_comment LIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_description LIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_name LIKE '%iscsi\_L1\_vms\_sd\_1%' OR  storage_domains_with_hosts_view.storage_pool_name::text LIKE '%iscsi\_L1\_vms\_sd\_1%' )  )))  ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100

Comment 2 Yaniv Kaul 2018-05-24 06:49:58 UTC
1. The title is wrong. Please change it to reflect the issue - search is taking too long.
2. Is the issue with the number of hosts? Or the number of... ? For example, you have >4K VMs and I assume at least a similar number of disks across multiple domains.
3. Can you provide more information on the setup - database, storage for it, Engine CPU, memory, ...
Your database seems to be on a Netapp - is it fast? There's a single interface (10g) to it, so I assume a single active path with 2 sessions? Or are both active? I'm not sure your multipath.conf is configured for best practices for that Netapp. Is it SSD based?

Comment 3 Ilan Zuckerman 2018-05-24 07:39:18 UTC
(In reply to Yaniv Kaul from comment #2)
> 1. The title is wrong. Please change it to reflect the issue - search is
> taking too long.
Changed

> 2. Is the issue with the number of hosts? Or the number of... ? For example,
> you have >4K VMs and I assume at least a similar number of disks across
> multiple domains.
> 3. Can you provide more information on the setup - database, storage for it,
> Engine CPU, memory, ...
> Your database seems to be on a Netapp - is it fast? There's a single
> interface (10g) to it, so I assume a single active path with 2 sessions? Or
> are both active? I'm not sure your multipath.conf is configured for best
> practices for that Netapp. Is it SSD based?

I think that the issue is really with that particular query i specified above. You can try making this REST GET call through any REST client and see that it takes forever:
https://b01-h21-r620.rhev.openstack.engineering.redhat.com/ovirt-engine/api/vms;max=10?search=status%3Dup+and+storage%3Discsi_L1_vms_sd_1

Observe the latest pg log. in this case:
/var/opt/rh/rh-postgresql95/lib/pgsql/data/pg_log/postgresql-Wed.log
It got clogged with this query i specified above, although nothing is invoking it.

Please note that i limit the query results to show ONLY 10 vms.
Lowering the max value to 1 makes this query much faster.

One more important point: 
In our env we got ~4K vms, but querying them with search params status=up (for example) takes ~1 Sec to complete. So the issue here is not retrieving thousands of results, but with this particular param in search called 'storage'. In other words, when combining any of search params with param called 'storage', we see the duration issue.

Comment 5 Roy Golan 2018-05-27 05:37:26 UTC
Pass criteria for this trivial search is under 1 sec even for 4k VMS.

Similar to Bug 1578357, it is possibly that this search can avoid using tags (keep me honest here that indeed this is the case)

Comment 7 Eli Mesika 2018-05-27 09:28:33 UTC
Please attach a dump of he databse for further investigation.

Comment 9 Roy Golan 2018-06-05 13:32:34 UTC
I'm trying to do explain enalyze on the following query and it takes 100c% cpu. Examining top output I don't see any io wait, just working cpus. Also vmstat looks ok. Counting the objects, I see ~ 4k entries in vms_with tags, and 400 in vds_with_tags.

need to reproduce the same with lower counts on other systems and see what we get


SELECT * FROM
(
SELECT * FROM vms WHERE
( vm_guid IN 
(SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags
   LEFT OUTER JOIN vds_with_tags ON vms_with_tags.run_on_vds=vds_with_tags.vds_id
   WHERE vms.vm_name LIKE '%L1%'
   AND (
  (
  vds_with_tags.cluster_name LIKE '%b01-h13%%'
        OR
  vds_with_tags.cpu_model LIKE '%b01-h13%%'
        OR
  vds_with_tags.free_text_comment LIKE '%b01-h13%%'
        OR
  vds_with_tags.host_name LIKE '%b01-h13%%'
        OR
  vds_with_tags.software_version LIKE '%b01-h13%%'
        OR
  vds_with_tags.storage_pool_name LIKE '%b01-h13%%'
        OR
  vds_with_tags.tag_name LIKE '%b01-h13%%'
        OR
  vds_with_tags.vds_name LIKE '%b01-h13%%'
 )
  AND
 vms.vm_name LIKE '%L1%'  )))  ORDER BY vm_name ASC


) as T1 OFFSET (1 -1) LIMIT 100;

Comment 10 mlehrer 2018-06-06 17:03:17 UTC
(In reply to Roy Golan from comment #9)
> I'm trying to do explain enalyze on the following query and it takes 100c%
> cpu. Examining top output I don't see any io wait, just working cpus. Also
> vmstat looks ok. Counting the objects, I see ~ 4k entries in vms_with tags,
> and 400 in vds_with_tags.
> 
> need to reproduce the same with lower counts on other systems and see what
> we get

The explain was re-run and the results can be found here:
 https://privatebin-it-iso.int.open.paas.redhat.com/?429636bf68822a2d#W7c7urbeuFAZOqzLfuMRCnYA7Gs28LeDpXQkboEQp6w=



> 
> 
> SELECT * FROM
> (
> SELECT * FROM vms WHERE
> ( vm_guid IN 
> (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags
>    LEFT OUTER JOIN vds_with_tags ON
> vms_with_tags.run_on_vds=vds_with_tags.vds_id
>    WHERE vms.vm_name LIKE '%L1%'
>    AND (
>   (
>   vds_with_tags.cluster_name LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.cpu_model LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.free_text_comment LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.host_name LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.software_version LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.storage_pool_name LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.tag_name LIKE '%b01-h13%%'
>         OR
>   vds_with_tags.vds_name LIKE '%b01-h13%%'
>  )
>   AND
>  vms.vm_name LIKE '%L1%'  )))  ORDER BY vm_name ASC
> 
> 
> ) as T1 OFFSET (1 -1) LIMIT 100;

Comment 11 mlehrer 2018-06-07 09:51:47 UTC
Created attachment 1448664 [details]
explain analyze output

Previous link shared is no longer valid, I've attached the output of explain as its own file here.

Comment 12 Ilan Zuckerman 2018-06-10 11:51:16 UTC
attaching the explain/analyze command output after full vaccum per Roy's request.
explain_analize_after_vacum.txt

The query:

-=>>date; psql -t -l -U engine -c "EXPLAIN ANALYZE SELECT * FROM(SELECT * FROM vms WHERE( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags LEFT OUTER JOIN vds_with_tags ON vms_with_tags.run_on_vds=vds_with_tags.vds_id WHERE vms.vm_name LIKE '%L1%' AND ((vds_with_tags.cluster_name LIKE '%b01-h13%%' OR vds_with_tags.cpu_model LIKE '%b01-h13%%' OR vds_with_tags.free_text_comment LIKE '%b01-h13%%' OR vds_with_tags.host_name LIKE '%b01-h13%%' OR vds_with_tags.software_version LIKE '%b01-h13%%' OR vds_with_tags.storage_pool_name LIKE '%b01-h13%%' OR vds_with_tags.tag_name LIKE '%b01-h13%%' OR vds_with_tags.vds_name LIKE '%b01-h13%%') AND vms.vm_name LIKE '%L1%'  ))) ORDER BY vm_name ASC) as T1 OFFSET (1 -1) LIMIT 100;" > explain_analize.txt; date

Start: Sun Jun 10 10:28:35 UTC 2018
End: Sun Jun 10 10:52:13 UTC 2018

Comment 14 Eli Mesika 2020-12-02 01:32:56 UTC
Investigation :

Generated query on 

VMS: status=up and storage=iscsi_L1_vms_sd_2

SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags   LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id    WHERE ((  vms.status = '1' AND storage_domains_with_hosts_view.storage_comment ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_description ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_name ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_pool_name::text ILIKE '%iscsi\_L1\_vms\_sd\_2%' )  ))) ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100

The problem here is that the inner query is using vms.status = '1' instead of vms_with_tags.status = '1' since vms_with_tags is already in the FROM clause of the inner query , adding a huge nested join in scaled env (<vms size> * <vms_with_tags> size) for each row of the outer query on vms view 

This is a general problem that is not specific to the status field or to the VM entity, it will occur every time that we have a mix of fields that forces using the views that supports tags in the inner query with one or more conditions on the main entity fields (VM in that case)


Suggested fixes :
================= 

1) force using the entity view with the tag support for the netted query on the main entity fields, this will require to list all those fields specifically to direct the search engine to handle them differently , we have already this mechanism in [1] for example 

2) Analyze the final query string, identifying if there is a nested query using tags and force all references to use the entity with tag support view in the nested query 

3) Move all conditions related to the main entity out of the nested query , this way the query above will be :

SELECT * FROM (SELECT * FROM vms WHERE ( vm_guid IN (SELECT distinct vms_with_tags.vm_guid FROM  vms_with_tags   LEFT OUTER JOIN storage_domains_with_hosts_view ON vms_with_tags.storage_id=storage_domains_with_hosts_view.id    WHERE ((  storage_domains_with_hosts_view.storage_comment ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_description ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_name ILIKE '%iscsi\_L1\_vms\_sd\_2%' OR  storage_domains_with_hosts_view.storage_pool_name::text ILIKE '%iscsi\_L1\_vms\_sd\_2%' )  ))) AND vms.status = '1' ORDER BY vm_name ASC ) as T1 OFFSET (1 -1) LIMIT 100


Methos #1 

PROS : clean 
CONS : static , new fields added to the search without this handling will still perform badly 

Method #2

PROS: dynamic , will fix the query and handle also new fields when added 
CONS: 1) More query processing (not critical since we use caching)
      2) Ugly code 
      
Method #3

PROS: actually solves the problem from its root 
CONS: requires major changes in search mechanism


Conclusion :


Method #1 is clean but not dynamic and mid risky 
Method #2 is not clean and least risky 
Method #3 is the cleanest but most risky 

We have to decide which approach to take in order to solve these kind of bad-performing queries for all potential search engine queries.

Martin, what do you think ?


[1] https://github.com/oVirt/ovirt-engine/blob/ec7ca5babc1e5535fc8c17331e0b8975766ce1e4/backend/manager/modules/searchbackend/src/main/java/org/ovirt/engine/core/searchbackend/SearchObjectAutoCompleter.java#L57

Comment 15 Martin Perina 2020-12-02 10:23:27 UTC
Discussed offline, we have chosen method #1

Comment 16 Tzahi Ashkenazi 2021-03-07 14:58:26 UTC
Reproduce first on the folowing:
Engine : red01
cluster : L0_Group_4,ISCSI LUN, 5TB , 6 hosts, VMs count : 1500 VMs 
version : ovirt-engine-4.4.2-4-001.noarch
query : [ status = up and Storage = L0_Group_4 ]
this operation took more than 10 min to complete 


Fix verification on the folowing:
Engine : red01
cluster : L0_Group_4,ISCSI LUN, 5TB , 6 hosts, VMs count : 1500 VMs 
version : ovirt-engine-4.4.5.8-0.9.el8ev.noarch
query : [ status = up and Storage = L0_Group_4 ]
this operation took around 3 sec. from the UI to complete successfully

Comment 17 Sandro Bonazzola 2021-03-18 15:13:33 UTC
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 18 Sandro Bonazzola 2021-03-22 12:55:35 UTC
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.