Bug 1583619

Summary:	[downstream clone - 4.2.4] [SCALE] Listing users in Users tab overloads the postgresql DB (CPU)
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	RHV bug bot <rhv-bugzilla-bot>
Component:	ovirt-engine	Assignee:	Roy Golan <rgolan>
Status:	CLOSED ERRATA	QA Contact:	mlehrer
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.1.10	CC:	bgraveno, dagur, emesika, gveitmic, joboyer, knakai, lsurette, lveyde, mgoldboi, mlehrer, mperina, pstehlik, rbalakri, rgolan, Rhev-m-bugs, rhodain, srevivo, ykaul
Target Milestone:	ovirt-4.2.4	Keywords:	Performance, ZStream
Target Release:	---	Flags:	lsvaty: testing_plan_complete-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ovirt-engine-4.2.4.1	Doc Type:	Bug Fix
Doc Text:	This update fixes an inefficient query that is generated when users click on the 'Users' tab in the Administration Portal. The fix ensures that the tab loads quicker.	Story Points:	---
Clone Of:	1578357	Environment:
Last Closed:	2018-06-27 10:02:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1578357
Bug Blocks:

Description RHV bug bot 2018-05-29 11:40:50 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1578357 +++
======================================================================

Description of problem:
When a high number of users are created in the environment and these users are assigned to different entities the system can get overloaded because of the Postgresql CPU load.

Version-Release number of selected component (if applicable):
4.1.10

How reproducible:
100%

Steps to Reproduce:
1.Crete 100 users
2.Assign these users to 100VM (disk operator)
3.Go to tab Users or execute sql:
SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;


Actual results:
each thread handling the execution of this SQL query takes 100% of a CPU core.

Expected results:
The sql is executed fast.

Additional info:
This issue is caused by the where condition within the subquery that forces the PostgreSQL to use inefficient execution plan. The performance of the query increases by simply moving the where condition outside the subquery.

(Originally by Roman Hodain)

Comment 1 RHV bug bot 2018-05-29 11:40:58 UTC

Created attachment 1436781 [details]
execution plan (condition outside of the subquery)

explain analyze SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags )) and  vdc_users.user_group = 'user'  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;

(Originally by Roman Hodain)

Comment 2 RHV bug bot 2018-05-29 11:41:04 UTC

Created attachment 1436782 [details]
execution plan (condition inside of the subquery)

The sql is not analyzed as the execution doe sot finish in reasonable time.

explain SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;

(Originally by Roman Hodain)

Comment 3 RHV bug bot 2018-05-29 11:41:09 UTC

# select count(*) from vdc_users_with_tags;
 count  
--------
 711945
(1 row)

# select count(*) from users;
 count 
-------
  9353
(1 row)

I have done the same tests with Postgres 9.5. The unmodified sql takes ~16sec and the modified query takes almost nothing.

As this is generated sql. Can we modify the code to move the condition outside of the subquery?

(Originally by Roman Hodain)

Comment 4 RHV bug bot 2018-05-29 11:41:14 UTC

Bug 1510432 which is merged to 4.2 will possibly solve that, I just need to verity the fix is doing that for tag searches like here. It handles 2 things - remove the generated 'distinct', thus choosing simpler query plans for most searches. Specifically it won't do the heaviest part here which is the huge sort, and choose disk space to do that. 

Bug 1532802 might be related as well, didn't dive deep into the details.

Roman your suggestion may be good, I suspect it may miss results though.

(Originally by Roy Golan)

Comment 6 RHV bug bot 2018-05-29 11:41:26 UTC

Regarding the work_mem. The execution plan does not use external sorts. Everything seems to fit into memory. I have also done some tests with much higher work mem than we introduced (256MB). The result is the same.

I have also checked with RHV 4.2.2.6 and the results are the same. The sql query is generated in the same way as before.

(Originally by Roman Hodain)

Comment 8 RHV bug bot 2018-05-29 11:41:42 UTC

(In reply to Roman Hodain from comment #6)
> Regarding the work_mem. The execution plan does not use external sorts.
> Everything seems to fit into memory.

You mentioned the analyze didn't complete in reasonable time, so we don't see the whole plan. Also in the modified query analysis I clearly see external sort -  
  "Sort Method: external merge  Disk: 91384kB"

Is it possible to wait for the analysis to be done?

(Originally by Roy Golan)

Comment 11 RHV bug bot 2018-05-29 11:41:59 UTC

(In reply to Roy Golan from comment #8)
> (In reply to Roman Hodain from comment #6)
> > Regarding the work_mem. The execution plan does not use external sorts.
> > Everything seems to fit into memory.
> 
> You mentioned the analyze didn't complete in reasonable time, so we don't
> see the whole plan. Also in the modified query analysis I clearly see
> external sort -  
>   "Sort Method: external merge  Disk: 91384kB"
> 
> Is it possible to wait for the analysis to be done?

regarding the external sort, yes it is in the modified query I missed that. Anyway, it is 90MB which may be too much for increasing the work_mem to this amount.

I have tried to let it finish, but it took more than one hour and it was still running. I have increased the work_mem to 1GB just for the curiosity. I will let you know if it finishes.

(Originally by Roman Hodain)

Comment 17 RHV bug bot 2018-05-29 11:42:35 UTC

Roman, I agree the WHERE condition should be outside, instead of generating a huge inner dataset to pass on (that's completely foolish thing to do inside the inner query on a the right table... like always, generated queries eventually hit you).

We can either fix that by doing so and another option is to completly remove the join on the tag table, which I seriously doubt is needed.

Eli do you know why the join on vdc_users_with_tags is needed at all?

(Originally by Roy Golan)

Comment 18 RHV bug bot 2018-05-29 11:42:41 UTC

(In reply to Roy Golan from comment #17)

> We can either fix that by doing so and another option is to completly remove
> the join on the tag table, which I seriously doubt is needed.
> 
> Eli do you know why the join on vdc_users_with_tags is needed at all?

This join should be used only when the search includes tags, if it is generated in other cases then this is a bug that we should fix

(Originally by Eli Mesika)

Comment 19 RHV bug bot 2018-05-29 11:42:47 UTC

(In reply to Roy Golan from comment #17)
> Roman, I agree the WHERE condition should be outside, instead of generating
> a huge inner dataset to pass on (that's completely foolish thing to do
> inside the inner query on a the right table... like always, generated
> queries eventually hit you).
> 
> We can either fix that by doing so and another option is to completly remove
> the join on the tag table, which I seriously doubt is needed.
> 
> Eli do you know why the join on vdc_users_with_tags is needed at all?

Hi Roy,

would it be worth to go both ways? We can remove the tags, but moving the condition outside could help in other cases. Sooner or later we may hit a similar issue in a different environment in a similar query.

(Originally by Roman Hodain)

Comment 22 RHV bug bot 2018-05-29 11:43:04 UTC

(In reply to Roy Golan from comment #17)
> Roman, I agree the WHERE condition should be outside, instead of generating
> a huge inner dataset to pass on (that's completely foolish thing to do
> inside the inner query on a the right table... like always, generated
> queries eventually hit you).
> 
> We can either fix that by doing so and another option is to completly remove
> the join on the tag table, which I seriously doubt is needed.

Let's begin with the WHERE outside. Roman, does that help? 
> 
> Eli do you know why the join on vdc_users_with_tags is needed at all?

Roman, are they using tags?

(Originally by Yaniv Kaul)

Comment 24 RHV bug bot 2018-05-29 11:43:16 UTC

(In reply to Yaniv Kaul from comment #22)
> (In reply to Roy Golan from comment #17)
> > Roman, I agree the WHERE condition should be outside, instead of generating
> > a huge inner dataset to pass on (that's completely foolish thing to do
> > inside the inner query on a the right table... like always, generated
> > queries eventually hit you).
> > 
> > We can either fix that by doing so and another option is to completly remove
> > the join on the tag table, which I seriously doubt is needed.
> 
> Let's begin with the WHERE outside. Roman, does that help? 
Yes it helps significantly.
> > 
> > Eli do you know why the join on vdc_users_with_tags is needed at all?
> 
> Roman, are they using tags?
No, they currently suffer from this issue when simply clicking on the tab or calling RestAPI on /users

(Originally by Roman Hodain)

Comment 25 RHV bug bot 2018-05-29 11:43:21 UTC

(In reply to Roman Hodain from comment #24)
> (In reply to Yaniv Kaul from comment #22)
> > (In reply to Roy Golan from comment #17)
> > > Roman, I agree the WHERE condition should be outside, instead of generating
> > > a huge inner dataset to pass on (that's completely foolish thing to do
> > > inside the inner query on a the right table... like always, generated
> > > queries eventually hit you).
> > > 
> > > We can either fix that by doing so and another option is to completly remove
> > > the join on the tag table, which I seriously doubt is needed.
> > 
> > Let's begin with the WHERE outside. Roman, does that help? 
> Yes it helps significantly.
> > > 
> > > Eli do you know why the join on vdc_users_with_tags is needed at all?
> > 
> > Roman, are they using tags?
> No, they currently suffer from this issue when simply clicking on the tab or
> calling RestAPI on /users

So the above sounds like our hotfix.

(Originally by Yaniv Kaul)

Comment 26 RHV bug bot 2018-05-29 11:43:26 UTC

(In reply to Yaniv Kaul from comment #25)
> (In reply to Roman Hodain from comment #24)
> > (In reply to Yaniv Kaul from comment #22)
> > > (In reply to Roy Golan from comment #17)
> > > > Roman, I agree the WHERE condition should be outside, instead of generating
> > > > a huge inner dataset to pass on (that's completely foolish thing to do
> > > > inside the inner query on a the right table... like always, generated
> > > > queries eventually hit you).
> > > > 
> > > > We can either fix that by doing so and another option is to completly remove
> > > > the join on the tag table, which I seriously doubt is needed.
> > > 
> > > Let's begin with the WHERE outside. Roman, does that help? 
> > Yes it helps significantly.
> > > > 
> > > > Eli do you know why the join on vdc_users_with_tags is needed at all?
> > > 
> > > Roman, are they using tags?
> > No, they currently suffer from this issue when simply clicking on the tab or
> > calling RestAPI on /users
> 
> So the above sounds like our hotfix.

So far there is no hotfix, the SQL is generated automatically by SearchEngine, there is no way how to use modified query without modifying SearchEngine SQL generation. And this is very delicate and it will probably take a few more days to have a patch which don't break other flows.

(Originally by Martin Perina)

Comment 27 RHV bug bot 2018-05-29 11:43:31 UTC

> --------
>  711945
> (1 row)
> 
> # select count(*) from users;
>  count 
> -------
>   9353
> (1 row)

Another small observation:
The view 'vdc_users_with_tags' have 700,000 rows while there are only 10,000 users. are there 70 tags for each user? This is possibly a bug in the view.

Can we get a dump to look into it? if there is a bug there we might be able to tweak the view, it's way easier and less risky.

More thoughts on this:
- the users with tags should not be crossed join if the search query isn't looking for tags at all.
- I think anyhow Roman's suggestion of moving the WHERE at the outer query is correct and should be included.

(Originally by Roy Golan)

Comment 28 RHV bug bot 2018-05-29 11:43:37 UTC

Roman 

Can you specify please the search string as it is passed to the search engine ?
AFAIK , tag views are not involved as long as you are not searching for tags.

Thanks

(Originally by Eli Mesika)

Comment 29 RHV bug bot 2018-05-29 11:43:43 UTC

Apart for the initial fix I propose to revisit the vdc_users_with_tags view to contain only records with tag that is not null. This should narrow down the view table dramatically. moreover, it will serve as an easy hot fix

currently select user_id,tag_id from vdc_users_with_tags:

              user_id                | tag_id 
--------------------------------------+--------
 c4335808-3cdb-11e8-867d-0242ac130002 | 
...

And the row number is >= the number of VMs this user has permission on

(Originally by Roy Golan)

Comment 30 RHV bug bot 2018-05-29 11:43:49 UTC

Extending my last comment - the change to the view will not be enough since the search is producing where on the user_id in the tag table. So the default (wrongly generated) search will return zero results.

(Originally by Roy Golan)

Comment 31 RHV bug bot 2018-05-29 11:43:54 UTC

I think we should do both, the patches to fix SearchEngine and another patch to remove null tags from users_with_tags view.

(Originally by Martin Perina)

Comment 32 RHV bug bot 2018-05-29 11:44:00 UTC

(In reply to Martin Perina from comment #31)
> I think we should do both, the patches to fix SearchEngine and another patch
> to remove null tags from users_with_tags view.

Worth a different BZ IMO

(Originally by Eli Mesika)

Comment 33 RHV bug bot 2018-05-29 11:44:07 UTC

I think we can't do it, because this search is meant for a general search on User/groups + VMs + tags. So this would break:

   Users: vm=vm1

It will not yield results if we make the view free of null tags oin case the user doesn't have tag, even though the user have permissions on that VM.

The name of the view is confusing and the implementation of the search is poor because it performs sub-optimal query, to say the least, just to aggregate info on Users with vms or tags

(Originally by Roy Golan)

Comment 34 RHV bug bot 2018-05-29 11:44:12 UTC

That is probably why the code was always including the vdc_users_with_tags in the search. So this obsoletes my first fix.

(Originally by Roy Golan)

Comment 39 RHV bug bot 2018-05-29 11:44:38 UTC

We can validate the bug in 4.2.4 version.

Could you please provide  validation instructions/notes for this fix? 

Should we just follow reproduction steps descried in the problem description:
1.Crete 100 users
2.Assign these users to 100VM (disk operator)  - [Daniel] WHAT is "100VM"?
3.Go to tab Users or execute sql:
SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;


Any significance for the distribution of users to specific role to asset?

(Originally by Daniel Gur)

Comment 40 Roy Golan 2018-05-31 07:57:14 UTC

Tips for verification:

- Verify the setup is made of thousands of users and > 100k records
  in vdc_uesrs_with_tags - to achieve this follow [1]

- Verify that the search from UI tab for 'Users' and 'Users type=user'
  performs under 500ms

- Verify the generated query for both 'Users' and 'Users: type=user'
  is not including a join on vdc_users_with_tags (you can consult the
  unit test patch[2])

- To quickly output the generated query from search into the engine.log:
    log-control.sh org.ovirt.engine.core.bll.SearchQuery DEBUG



[1] env setup 
   - create a diskless VM, make a template out of it
   - create 5 pools from that template with say 100 vms
   - make 100 users to be the admin of each pool.
     That would translate into inherited permission of
     on each of the VMs and will create 100 * 500 permission records in
     vdc_uesrs_with_tags table

[2] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/searchbackend/src/test/java/org/ovirt/engine/core/searchbackend/SyntaxContainerTest.java

Comment 42 mlehrer 2018-06-07 09:30:40 UTC

(In reply to Roy Golan from comment #40)
> Tips for verification:
> 
> - Verify the setup is made of thousands of users and > 100k records
>   in vdc_uesrs_with_tags - to achieve this follow [1]
> 
> - Verify that the search from UI tab for 'Users' and 'Users type=user'
>   performs under 500ms
> 
> - Verify the generated query for both 'Users' and 'Users: type=user'
>   is not including a join on vdc_users_with_tags (you can consult the
>   unit test patch[2])
> 
> - To quickly output the generated query from search into the engine.log:
>     log-control.sh org.ovirt.engine.core.bll.SearchQuery DEBUG
> 
> 
> 
> [1] env setup 
>    - create a diskless VM, make a template out of it
>    - create 5 pools from that template with say 100 vms
>    - make 100 users to be the admin of each pool.
>      That would translate into inherited permission of
>      on each of the VMs and will create 100 * 500 permission records in
>      vdc_uesrs_with_tags table
> 
> [2]
> https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/
> searchbackend/src/test/java/org/ovirt/engine/core/searchbackend/
> SyntaxContainerTest.java


#What was done:
created 5 pools of 100 Vms each.
create 380 users in internal domain via ovirt-aaa-jdbc-tool user add
then used ovirt_users module and ovirt_permissions to update permissions.


Click "users" tab (very slow, paging between users is slow)

Postgres shows:
2018-06-02 20:02:59.712 UTC LOG:  duration: 60818.421 ms  execute <unnamed>: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100                                                                                  
2018-06-02 20:03:05.418 UTC LOG:  duration: 65654.598 ms  execute <unnamed>: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100                                                                                  
2018-06-02 20:04:23.450 UTC LOG:  duration: 78011.641 ms  execute <unnamed>: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100  


#Upgraded to 4.2.4

From UI clicking Users tab snappy and fast, no queries listed over 500 ms duration via postgres duration logging.
Checked browsing Users tabs, paginating of Users tab result pages - both of which previously were very difficult to use.

Additionally checked these searches and they were fine.

"Users:",
"Users: type=user",
"Users: type=group",
"Hosts:",
"Vms:",
"Datacenters:",
"Clusters:"

engine=# select count(*) from vdc_users_with_tags;
 count
-------
 14221
(1 row)

Comment 44 errata-xmlrpc 2018-06-27 10:02:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2071

Comment 45 Franta Kust 2019-05-16 13:06:55 UTC

BZ<2>Jira Resync

Comment 46 Daniel Gur 2019-08-28 13:13:44 UTC

sync2jira

Comment 47 Daniel Gur 2019-08-28 13:17:58 UTC

sync2jira