Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1578357

Summary:

[SCALE] Listing users in Users tab overloads the postgresql DB (CPU)

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Roman Hodain <rhodain>

Component:

ovirt-engine

Assignee:

Roy Golan <rgolan>

Status:

CLOSED ERRATA

QA Contact:

mlehrer

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

4.1.10

CC:

emesika, gveitmic, joboyer, knakai, lsurette, lsvaty, mgoldboi, mlehrer, mperina, pstehlik, rgolan, Rhev-m-bugs, rhodain, srevivo

Target Milestone:

ovirt-4.3.0

Keywords:

Performance, ZStream

Target Release:

4.3.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

ovirt-engine-4.3.0_alpha

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1583619 (view as bug list)

Environment:

Last Closed:

2019-05-08 12:37:35 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Scale

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1583619

Attachments:

Description	Flags
execution plan (condition outside of the subquery)	none
execution plan (condition inside of the subquery)	none

Description Roman Hodain 2018-05-15 11:48:36 UTC

Description of problem:
When a high number of users are created in the environment and these users are assigned to different entities the system can get overloaded because of the Postgresql CPU load.

Version-Release number of selected component (if applicable):
4.1.10

How reproducible:
100%

Steps to Reproduce:
1.Crete 100 users
2.Assign these users to 100VM (disk operator)
3.Go to tab Users or execute sql:
SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;


Actual results:
each thread handling the execution of this SQL query takes 100% of a CPU core.

Expected results:
The sql is executed fast.

Additional info:
This issue is caused by the where condition within the subquery that forces the PostgreSQL to use inefficient execution plan. The performance of the query increases by simply moving the where condition outside the subquery.

Comment 1 Roman Hodain 2018-05-15 11:56:07 UTC

Created attachment 1436781 [details]
execution plan (condition outside of the subquery)

explain analyze SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags )) and  vdc_users.user_group = 'user'  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;

Comment 2 Roman Hodain 2018-05-15 11:57:22 UTC

Created attachment 1436782 [details]
execution plan (condition inside of the subquery)

The sql is not analyzed as the execution doe sot finish in reasonable time.

explain SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;

Comment 3 Roman Hodain 2018-05-15 12:00:41 UTC

# select count(*) from vdc_users_with_tags;
 count  
--------
 711945
(1 row)

# select count(*) from users;
 count 
-------
  9353
(1 row)

I have done the same tests with Postgres 9.5. The unmodified sql takes ~16sec and the modified query takes almost nothing.

As this is generated sql. Can we modify the code to move the condition outside of the subquery?

Comment 4 Roy Golan 2018-05-15 20:05:19 UTC

Bug 1510432 which is merged to 4.2 will possibly solve that, I just need to verity the fix is doing that for tag searches like here. It handles 2 things - remove the generated 'distinct', thus choosing simpler query plans for most searches. Specifically it won't do the heaviest part here which is the huge sort, and choose disk space to do that. 

Bug 1532802 might be related as well, didn't dive deep into the details.

Roman your suggestion may be good, I suspect it may miss results though.

Comment 6 Roman Hodain 2018-05-16 05:51:00 UTC

Regarding the work_mem. The execution plan does not use external sorts. Everything seems to fit into memory. I have also done some tests with much higher work mem than we introduced (256MB). The result is the same.

I have also checked with RHV 4.2.2.6 and the results are the same. The sql query is generated in the same way as before.

Comment 8 Roy Golan 2018-05-16 07:15:11 UTC

(In reply to Roman Hodain from comment #6)
> Regarding the work_mem. The execution plan does not use external sorts.
> Everything seems to fit into memory.

You mentioned the analyze didn't complete in reasonable time, so we don't see the whole plan. Also in the modified query analysis I clearly see external sort -  
  "Sort Method: external merge  Disk: 91384kB"

Is it possible to wait for the analysis to be done?

Comment 11 Roman Hodain 2018-05-16 08:02:12 UTC

(In reply to Roy Golan from comment #8)
> (In reply to Roman Hodain from comment #6)
> > Regarding the work_mem. The execution plan does not use external sorts.
> > Everything seems to fit into memory.
> 
> You mentioned the analyze didn't complete in reasonable time, so we don't
> see the whole plan. Also in the modified query analysis I clearly see
> external sort -  
>   "Sort Method: external merge  Disk: 91384kB"
> 
> Is it possible to wait for the analysis to be done?

regarding the external sort, yes it is in the modified query I missed that. Anyway, it is 90MB which may be too much for increasing the work_mem to this amount.

I have tried to let it finish, but it took more than one hour and it was still running. I have increased the work_mem to 1GB just for the curiosity. I will let you know if it finishes.

Comment 17 Roy Golan 2018-05-17 08:24:36 UTC

Roman, I agree the WHERE condition should be outside, instead of generating a huge inner dataset to pass on (that's completely foolish thing to do inside the inner query on a the right table... like always, generated queries eventually hit you).

We can either fix that by doing so and another option is to completly remove the join on the tag table, which I seriously doubt is needed.

Eli do you know why the join on vdc_users_with_tags is needed at all?

Comment 18 Eli Mesika 2018-05-17 11:05:23 UTC

(In reply to Roy Golan from comment #17)

> We can either fix that by doing so and another option is to completly remove
> the join on the tag table, which I seriously doubt is needed.
> 
> Eli do you know why the join on vdc_users_with_tags is needed at all?

This join should be used only when the search includes tags, if it is generated in other cases then this is a bug that we should fix

Comment 19 Roman Hodain 2018-05-17 11:21:47 UTC

(In reply to Roy Golan from comment #17)
> Roman, I agree the WHERE condition should be outside, instead of generating
> a huge inner dataset to pass on (that's completely foolish thing to do
> inside the inner query on a the right table... like always, generated
> queries eventually hit you).
> 
> We can either fix that by doing so and another option is to completly remove
> the join on the tag table, which I seriously doubt is needed.
> 
> Eli do you know why the join on vdc_users_with_tags is needed at all?

Hi Roy,

would it be worth to go both ways? We can remove the tags, but moving the condition outside could help in other cases. Sooner or later we may hit a similar issue in a different environment in a similar query.

Comment 22 Yaniv Kaul 2018-05-18 06:39:42 UTC

(In reply to Roy Golan from comment #17)
> Roman, I agree the WHERE condition should be outside, instead of generating
> a huge inner dataset to pass on (that's completely foolish thing to do
> inside the inner query on a the right table... like always, generated
> queries eventually hit you).
> 
> We can either fix that by doing so and another option is to completly remove
> the join on the tag table, which I seriously doubt is needed.

Let's begin with the WHERE outside. Roman, does that help? 
> 
> Eli do you know why the join on vdc_users_with_tags is needed at all?

Roman, are they using tags?

Comment 24 Roman Hodain 2018-05-18 13:58:16 UTC

(In reply to Yaniv Kaul from comment #22)
> (In reply to Roy Golan from comment #17)
> > Roman, I agree the WHERE condition should be outside, instead of generating
> > a huge inner dataset to pass on (that's completely foolish thing to do
> > inside the inner query on a the right table... like always, generated
> > queries eventually hit you).
> > 
> > We can either fix that by doing so and another option is to completly remove
> > the join on the tag table, which I seriously doubt is needed.
> 
> Let's begin with the WHERE outside. Roman, does that help? 
Yes it helps significantly.
> > 
> > Eli do you know why the join on vdc_users_with_tags is needed at all?
> 
> Roman, are they using tags?
No, they currently suffer from this issue when simply clicking on the tab or calling RestAPI on /users

Comment 25 Yaniv Kaul 2018-05-18 14:07:00 UTC

(In reply to Roman Hodain from comment #24)
> (In reply to Yaniv Kaul from comment #22)
> > (In reply to Roy Golan from comment #17)
> > > Roman, I agree the WHERE condition should be outside, instead of generating
> > > a huge inner dataset to pass on (that's completely foolish thing to do
> > > inside the inner query on a the right table... like always, generated
> > > queries eventually hit you).
> > > 
> > > We can either fix that by doing so and another option is to completly remove
> > > the join on the tag table, which I seriously doubt is needed.
> > 
> > Let's begin with the WHERE outside. Roman, does that help? 
> Yes it helps significantly.
> > > 
> > > Eli do you know why the join on vdc_users_with_tags is needed at all?
> > 
> > Roman, are they using tags?
> No, they currently suffer from this issue when simply clicking on the tab or
> calling RestAPI on /users

So the above sounds like our hotfix.

Comment 26 Martin Perina 2018-05-18 14:38:55 UTC

(In reply to Yaniv Kaul from comment #25)
> (In reply to Roman Hodain from comment #24)
> > (In reply to Yaniv Kaul from comment #22)
> > > (In reply to Roy Golan from comment #17)
> > > > Roman, I agree the WHERE condition should be outside, instead of generating
> > > > a huge inner dataset to pass on (that's completely foolish thing to do
> > > > inside the inner query on a the right table... like always, generated
> > > > queries eventually hit you).
> > > > 
> > > > We can either fix that by doing so and another option is to completly remove
> > > > the join on the tag table, which I seriously doubt is needed.
> > > 
> > > Let's begin with the WHERE outside. Roman, does that help? 
> > Yes it helps significantly.
> > > > 
> > > > Eli do you know why the join on vdc_users_with_tags is needed at all?
> > > 
> > > Roman, are they using tags?
> > No, they currently suffer from this issue when simply clicking on the tab or
> > calling RestAPI on /users
> 
> So the above sounds like our hotfix.

So far there is no hotfix, the SQL is generated automatically by SearchEngine, there is no way how to use modified query without modifying SearchEngine SQL generation. And this is very delicate and it will probably take a few more days to have a patch which don't break other flows.

Comment 27 Roy Golan 2018-05-21 08:22:38 UTC

> --------
>  711945
> (1 row)
> 
> # select count(*) from users;
>  count 
> -------
>   9353
> (1 row)

Another small observation:
The view 'vdc_users_with_tags' have 700,000 rows while there are only 10,000 users. are there 70 tags for each user? This is possibly a bug in the view.

Can we get a dump to look into it? if there is a bug there we might be able to tweak the view, it's way easier and less risky.

More thoughts on this:
- the users with tags should not be crossed join if the search query isn't looking for tags at all.
- I think anyhow Roman's suggestion of moving the WHERE at the outer query is correct and should be included.

Comment 28 Eli Mesika 2018-05-21 14:24:10 UTC

Roman 

Can you specify please the search string as it is passed to the search engine ?
AFAIK , tag views are not involved as long as you are not searching for tags.

Thanks

Comment 29 Roy Golan 2018-05-22 08:32:51 UTC

Apart for the initial fix I propose to revisit the vdc_users_with_tags view to contain only records with tag that is not null. This should narrow down the view table dramatically. moreover, it will serve as an easy hot fix

currently select user_id,tag_id from vdc_users_with_tags:

              user_id                | tag_id 
--------------------------------------+--------
 c4335808-3cdb-11e8-867d-0242ac130002 | 
...

And the row number is >= the number of VMs this user has permission on

Comment 30 Roy Golan 2018-05-22 08:38:53 UTC

Extending my last comment - the change to the view will not be enough since the search is producing where on the user_id in the tag table. So the default (wrongly generated) search will return zero results.

Comment 31 Martin Perina 2018-05-22 09:03:04 UTC

I think we should do both, the patches to fix SearchEngine and another patch to remove null tags from users_with_tags view.

Comment 32 Eli Mesika 2018-05-22 09:56:37 UTC

(In reply to Martin Perina from comment #31)
> I think we should do both, the patches to fix SearchEngine and another patch
> to remove null tags from users_with_tags view.

Worth a different BZ IMO

Comment 33 Roy Golan 2018-05-22 10:43:57 UTC

I think we can't do it, because this search is meant for a general search on User/groups + VMs + tags. So this would break:

   Users: vm=vm1

It will not yield results if we make the view free of null tags oin case the user doesn't have tag, even though the user have permissions on that VM.

The name of the view is confusing and the implementation of the search is poor because it performs sub-optimal query, to say the least, just to aggregate info on Users with vms or tags

Comment 34 Roy Golan 2018-05-22 10:47:51 UTC

That is probably why the code was always including the vdc_users_with_tags in the search. So this obsoletes my first fix.

Comment 39 Daniel Gur 2018-05-29 11:03:43 UTC

We can validate the bug in 4.2.4 version.

Could you please provide  validation instructions/notes for this fix? 

Should we just follow reproduction steps descried in the problem description:
1.Crete 100 users
2.Assign these users to 100VM (disk operator)  - [Daniel] WHAT is "100VM"?
3.Go to tab Users or execute sql:
SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100;


Any significance for the distribution of users to specific role to asset?

Comment 41 Martin Perina 2018-05-29 11:47:00 UTC

(In reply to Daniel Gur from comment #39)
> We can validate the bug in 4.2.4 version.
> 
> Could you please provide  validation instructions/notes for this fix? 
> 
> Should we just follow reproduction steps descried in the problem description:
> 1.Crete 100 users
> 2.Assign these users to 100VM (disk operator)  - [Daniel] WHAT is "100VM"?

Each user should have a DiskOperator (and UserRole probably) permission on single VM

> 3.Go to tab Users or execute sql:
> SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct
> vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE 
> vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1)
> LIMIT 100;

If you click on Users tab, you should observe following:

1. It's much faster than in previous version
2. You should verify that SQL execuited to fetch data into the view is not using joing to vdc_users_with_tags table unless specific VM or tag is added to the search.

Comment 42 mlehrer 2018-06-04 10:05:37 UTC

(In reply to Martin Perina from comment #41)
> (In reply to Daniel Gur from comment #39)
> > We can validate the bug in 4.2.4 version.
> > 
> > Could you please provide  validation instructions/notes for this fix? 
> > 
> > Should we just follow reproduction steps descried in the problem description:
> > 1.Crete 100 users
> > 2.Assign these users to 100VM (disk operator)  - [Daniel] WHAT is "100VM"?
> 
> Each user should have a DiskOperator (and UserRole probably) permission on
> single VM
> 
> > 3.Go to tab Users or execute sql:
> > SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct
> > vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE 
> > vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1)
> > LIMIT 100;
> 
> If you click on Users tab, you should observe following:
> 
> 1. It's much faster than in previous version
> 2. You should verify that SQL execuited to fetch data into the view is not
> using joing to vdc_users_with_tags table unless specific VM or tag is added
> to the search.


#What was done:
created 5 pools of 100 Vms each.
create 380 users in internal domain via ovirt-aaa-jdbc-tool user add
then used ovirt_users module and ovirt_permissions to update permissions.


Click "users" tab (very slow, paging between users is slow)

Postgres shows:
2018-06-02 20:02:59.712 UTC LOG:  duration: 60818.421 ms  execute <unnamed>: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100                                                                                  
2018-06-02 20:03:05.418 UTC LOG:  duration: 65654.598 ms  execute <unnamed>: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100                                                                                  
2018-06-02 20:04:23.450 UTC LOG:  duration: 78011.641 ms  execute <unnamed>: SELECT * FROM (SELECT * FROM vdc_users WHERE ( user_id IN (SELECT distinct vdc_users_with_tags.user_id FROM  vdc_users_with_tags   WHERE  vdc_users.user_group = 'user' ))  ORDER BY name ASC ) as T1 OFFSET (1 -1) LIMIT 100  


#Upgraded to 4.2.4

From UI clicking Users tab snappy and fast, no queries listed over 500 ms duration via postgres duration logging.
Checked browsing Users tabs, paginating of Users tab result pages - both of which previously were very difficult to use.

Additionally checked these searches and they were fine.

"Users:",
"Users: type=user",
"Users: type=group",
"Hosts:",
"Vms:",
"Datacenters:",
"Clusters:"

engine=# select count(*) from vdc_users_with_tags;
 count
-------
 14221
(1 row)

Comment 44 mlehrer 2019-02-24 12:18:48 UTC

#What was done:
created 5 pools of 100 Vms each.
create 500 users in internal domain via ovirt-aaa-jdbc-tool user add
then used ovirt_users module and ovirt_permissions to update permissions.

engine=# select count(*) from vdc_users_with_tags;
 count 
-------
 18590
(1 row)

engine=# 


Running ovirt-engine-4.3.0.4-0.1.el7.noarch & vdsm-4.30.9-1.el7ev.x86_64
Checked browsing Users tabs, paginating of Users tab result pages.
Additionally checked these searches1[1] and they were fine did see:
db=engine,user=engine,app=PostgreSQL JDBC Driver,client=127.0.0.1 LOG:  duration: 593.434 ms  execute <unnamed>: select * from  getpermissionsbyadelementid($1, $2, $3, $4)
but this database is running both ovirt_engine and ovirt_engine_history so seems fair moving to verified.

[1] Searches
"Users:",
"Users: type=user",
"Users: type=group",
"Hosts:",
"Vms:",
"Datacenters:",
"Clusters:"

Comment 46 errata-xmlrpc 2019-05-08 12:37:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 47 Daniel Gur 2019-08-28 13:15:01 UTC

sync2jira

Comment 48 Daniel Gur 2019-08-28 13:20:04 UTC

sync2jira